huishenlab / biscuit

BISulfite-seq CUI Toolkit
Other
16 stars 7 forks source link

Loading reference genome takes a lot of time #56

Open demis001 opened 2 months ago

demis001 commented 2 months ago

Describe the bug I am trying it for the first time. Can you tell me why loading a reference genome takes so long?

BISCUIT version Program: BISCUIT (BISulfite-seq CUI Toolkit) Version: 1.5.0

Minimally Reproducible Example

  1. Reference genome: hg38
    
    [M::bwa_idx_load_from_disk] read 0 ALT contigs
    [M::process] read 7419966 sequences (1000000194 bp)...
    [M::process] read 7432338 sequences (1000000207 bp)...
    [M::mem_pestat] # candidate unique pairs: 380317
    [M::mem_pestat] (25, 50, 75) percentile: (93, 314, 418)

It has been performing the aforementioned process for the last four hours, still no data in the output folder

  1. Paired-end data
for SAMPLE  in `cat ./sample.txt`
 do
     #FL1=$PWD/rawData/${SAMPLE}_R1.fastq.gz
     #FL2=$PWD/rawData/${SAMPLE}_R2.fastq.gz
     #trim_galore -j 100  -a "AGATCGGAAGAGC" -a2 "AGATCGGAAGAGC" --paired --retain_unpaired --2colour 20 \
     #    --clip_r1 12 --clip_r2 12 --output_dir trimmed_fastq ${FL1} ${FL2}

     # Step 1
     echo $SAMPLE
     FL1=${TRIMDIR}/${SAMPLE}_R1_val_1.fq.gz
     FL2=${TRIMDIR}/${SAMPLE}_R2_val_2.fq.gz
     #  mapping
     biscuit align -b 1 -@ 100 -R  "@RG\tID:proper_pair\tSM:${SAMPLE}" ${GDIR}/hg38_r109.fa ${FL1} ${FL2} \
         | samtools sort  -@ 100 -n   -o ${BAMOUT}/${SAMPLE}_paired_sorted.bam -O BAM -
     dupsifter --stats-output  --remove-dups ${BAMOUT}/${SAMPLE}_dupsifter_report.txt \
         -o ${BAMOUT}/${SAMPLE}_paired_sorted_dedup.bam ${GDIR}/hg38_r109.fa  ${BAMOUT}/${SAMPLE}_paired_sorted.bam
 done

System info:

Ubuntu 24.04 CPU: 100 RAM: 1.5 TB DISK: ALL SSD

I am running 12 paired-end libraries (the adapter trimming was completed in less than an hour for all these 12 paired-end libraries on this system.

@demis001

jamorrison commented 2 months ago

Hi @demis001,

Thanks for reaching out. Did your alignments finish over the weekend? Based on the output you sent, the alignment had started already (the [M::process] lines are printed after the reference has been loaded and reads are starting to be processed).

As for runtime, one issue that may affect how long the alignments take is the quality of the input data. For example, samples with high fractions of poly-G reads from the NextSeq/NovaSeq sequencers will take significantly longer than a comparable number of reads with a low fraction of poly-G reads.

With your data being piped straight to samtools sort, I'm not surprised that output wasn't seen for some time as samtools was collecting reads to start sorting, particularly with using all threads for both biscuit and samtools. If you need to rerun your alignments, I would probably suggest giving biscuit most of the threads available (say 85 or 90 in your case) and giving the remaining threads to samtools. This will ensure the two tools aren't competing with each other for resources.

On a side note, based on what you sent, the --stats-output and --remove-dups options in your dupsifter command will need to be switched prior to running dupsifter (i.e., dupsifter --remove-dups --stats-output dupsifter.stats.txt ...). You may have already encountered this issue, but I wanted to make you aware of this if you hadn't gotten to deduplicating yet.

Best, Jacob