Error with Corrrecting UB tags when working with large sample size

Describe the bug Getting the following error -

Correcting UB tags...
[1] "5.4e+08 Reads per chunk"
[1] "2024-01-28 15:27:24 CET"
[1] "Here are the detected subsampling options:"
[1] "Automatic downsampling"
[1] "Working on barcode chunk 1 out of 2"
[1] "Processing 403 barcodes in this chunk..."
[1] "Working on barcode chunk 2 out of 2"
[1] "Processing 265 barcodes in this chunk..."
Error in alldt[[i]][[1]] <- rbind(alldt[[i]][[1]], newdt[[i]][[1]]) :
  more elements supplied than there are to replace
Calls: bindList
In addition: Warning messages:
1: In parallel::mclapply(mapList, function(tt) { :
  all scheduled cores encountered errors in user code
2: In parallel::mclapply(mapList, function(tt) { :
  all scheduled cores encountered errors in user code
Execution halted

Running this in Rackham, with single-end reads generated from SmartSeq3.

Some context - I used merge_demultiplexed_fastq.R to combine our ~600 samples, resulting in an R1.fastq.gz file of 30 GB and index of 5 GB. I modified the STAR alignment code to work with 1 instance with 20 threads.

The generated filtered.Aligned.GeneTagged.sorted.bam had a few reads with negative position, hence I removed those reads from the BAM file and indexed them. It then proceeded until I got the above error. I performed a test with small number of samples and was able to generate the full output.

For now, I am planning to split the input files into chunk and process them in batches of ~300 samples each, then merging the generated count table. Is that a viable option, or is it better to process the entire data together? cohort.yaml.txt

sdparekh / zUMIs

Error with Corrrecting UB tags when working with large sample size #388