sdparekh / zUMIs

zUMIs: A fast and flexible pipeline to process RNA sequencing data with UMIs
GNU General Public License v3.0
268 stars 67 forks source link

Error with Corrrecting UB tags when working with large sample size #388

Open kvn95ss opened 5 months ago

kvn95ss commented 5 months ago

Describe the bug Getting the following error -

Correcting UB tags...
[1] "5.4e+08 Reads per chunk"
[1] "2024-01-28 15:27:24 CET"
[1] "Here are the detected subsampling options:"
[1] "Automatic downsampling"
[1] "Working on barcode chunk 1 out of 2"
[1] "Processing 403 barcodes in this chunk..."
[1] "Working on barcode chunk 2 out of 2"
[1] "Processing 265 barcodes in this chunk..."
Error in alldt[[i]][[1]] <- rbind(alldt[[i]][[1]], newdt[[i]][[1]]) :
  more elements supplied than there are to replace
Calls: bindList
In addition: Warning messages:
1: In parallel::mclapply(mapList, function(tt) { :
  all scheduled cores encountered errors in user code
2: In parallel::mclapply(mapList, function(tt) { :
  all scheduled cores encountered errors in user code
Execution halted

Running this in Rackham, with single-end reads generated from SmartSeq3.

Some context - I used merge_demultiplexed_fastq.R to combine our ~600 samples, resulting in an R1.fastq.gz file of 30 GB and index of 5 GB. I modified the STAR alignment code to work with 1 instance with 20 threads.

The generated filtered.Aligned.GeneTagged.sorted.bam had a few reads with negative position, hence I removed those reads from the BAM file and indexed them. It then proceeded until I got the above error. I performed a test with small number of samples and was able to generate the full output.

For now, I am planning to split the input files into chunk and process them in batches of ~300 samples each, then merging the generated count table. Is that a viable option, or is it better to process the entire data together? cohort.yaml.txt