Loading reference genome takes a lot of time

huishenlab / biscuit

BISulfite-seq CUI Toolkit

Other

16 stars 7 forks source link

[M::bwa_idx_load_from_disk] read 0 ALT contigs [M::process] read 7419966 sequences (1000000194 bp)... [M::process] read 7432338 sequences (1000000207 bp)... [M::mem_pestat] # candidate unique pairs: 380317 [M::mem_pestat] (25, 50, 75) percentile: (93, 314, 418)

for SAMPLE in `cat ./sample.txt` do #FL1=$PWD/rawData/${SAMPLE}_R1.fastq.gz #FL2=$PWD/rawData/${SAMPLE}_R2.fastq.gz #trim_galore -j 100 -a "AGATCGGAAGAGC" -a2 "AGATCGGAAGAGC" --paired --retain_unpaired --2colour 20 \ # --clip_r1 12 --clip_r2 12 --output_dir trimmed_fastq ${FL1} ${FL2} # Step 1 echo $SAMPLE FL1=${TRIMDIR}/${SAMPLE}_R1_val_1.fq.gz FL2=${TRIMDIR}/${SAMPLE}_R2_val_2.fq.gz # mapping biscuit align -b 1 -@ 100 -R "@RG\tID:proper_pair\tSM:${SAMPLE}" ${GDIR}/hg38_r109.fa ${FL1} ${FL2} \ | samtools sort -@ 100 -n -o ${BAMOUT}/${SAMPLE}_paired_sorted.bam -O BAM - dupsifter --stats-output --remove-dups ${BAMOUT}/${SAMPLE}_dupsifter_report.txt \ -o ${BAMOUT}/${SAMPLE}_paired_sorted_dedup.bam ${GDIR}/hg38_r109.fa ${BAMOUT}/${SAMPLE}_paired_sorted.bam done

Hi @demis001,

Thanks for reaching out. Did your alignments finish over the weekend? Based on the output you sent, the alignment had started already (the [M::process] lines are printed after the reference has been loaded and reads are starting to be processed).

As for runtime, one issue that may affect how long the alignments take is the quality of the input data. For example, samples with high fractions of poly-G reads from the NextSeq/NovaSeq sequencers will take significantly longer than a comparable number of reads with a low fraction of poly-G reads.

With your data being piped straight to samtools sort, I'm not surprised that output wasn't seen for some time as samtools was collecting reads to start sorting, particularly with using all threads for both biscuit and samtools. If you need to rerun your alignments, I would probably suggest giving biscuit most of the threads available (say 85 or 90 in your case) and giving the remaining threads to samtools. This will ensure the two tools aren't competing with each other for resources.

On a side note, based on what you sent, the --stats-output and --remove-dups options in your dupsifter command will need to be switched prior to running dupsifter (i.e., dupsifter --remove-dups --stats-output dupsifter.stats.txt ...). You may have already encountered this issue, but I wanted to make you aware of this if you hadn't gotten to deduplicating yet.

Best, Jacob

huishenlab / biscuit

Loading reference genome takes a lot of time #56