The 19 of 20 parallelized workers always strike far before the whole job is done

mimi3421 commented 1 year ago

At the beginning, this is a very nice work. Thanks to the author.

I'm analyzing a 5' VDJ enriched single-cell data from 10X pipeline and using version 1.2.1 as micromamba has some dependency problems with the newest 1.2.3 (lacking C++11 support?).

The problem is that after the 19 paralleled workers finish their works in 1 hour, there is always worker 7 left and take another 2~3 hours to finish. I check the temp vcf file generated by worker 6 and 8 and find that the genomic region allocated to worker 7 is between chromosome 5 and 6, which may be the biased enriched region in sequencing depth. I'm not familiar with C++ but from the python version, it seems that the works are allocated once at the beginning by the region of the reference file. Is it possible to allocated the jobs by the chunks of BAM file as all reads are alligned in coordinate to avoid this situation?

The bash line I use to run the job is as follows:

cellsnp-lite -s possorted_genome_bam.bam -b filtered.barcodes.tsv.gz -O /tmp/test -R genome1K.phase3.SNP_AF5e2.chr1toX.hg38.vcf.gz -p 20 --minMAF 0.1 --minCOUNT 20 --gzip 1>/tmp/test/log.log 2>&1

hxj5 commented 1 year ago

Hi, thanks for the feedback. Sometimes certain thread(s) could indeed stick for a long time when the read depth is very high. Unfortunately, it is difficult to change the framework of cellsnp-lite to allocate jobs by the chunks of BAM file, as htslib (the low-level library that cellsnp-lite depends on to perform pileup) does not support it yet, as far as I know.

hxj5 commented 1 year ago

To address this issue, we are thinking about two strategies: 1) split the SNP list (mode 1) or chromosome regions (mode 2) into smaller batches and push the batches into the thread pool. However, it could have additional overhead regarding the initialization work (e.g., prepare mplp structure) when re-using a thread. 2) implement a max-depth option to avoid the huge time and memory usage at a high-read-depth region. Although we have an alpha version of this strategy in v1.2.3 that the thread will stop pileup and move on to next SNP if the read count of current SNP has exceeded max-depth, a better implementation is needed, e.g., with reservoir sampling.

We may try to implement these two strategies (or some others if available) in the future. Thanks for your good question.

wjzwjz5 commented 5 months ago

Hi,I think my problem is somewhat similar to this.The code I run my job is show below: singularity exec Demuxafy.sif cellsnp-lite -p 40 --minMAF 0.1 --minCOUNT 20 --gzip -s possorted_genome_bam.bam -b barcodes.tsv -O output -R merged.vcf.gz " INFO: Converting SIF file to temporary sandbox... [I::main] start time: 2024-01-17 17:25:21 [I::main] loading the VCF file for given SNPs ... [I::main] fetching 68674511 candidate variants ... [I::main] mode 1a: fetch given SNPs in 62747 single cells. [I::csp_fetch_core][Thread-27] 2.00% SNPs processed. ... [I::csp_fetch_core][Thread-24] 72.00% SNPs processed.Then the process gets stuck for days,I tried to commit the other bam file and its vcf counterpart, but got stuck in the same process and schedule. [I::csp_fetch_core][Thread-24] 72.00% SNPs processed

single-cell-genetics / cellsnp-lite

The 19 of 20 parallelized workers always strike far before the whole job is done #88