timoast / sinto

Tools for single-cell data processing
https://timoast.github.io/sinto/
MIT License
118 stars 25 forks source link

sinto filterbarcodes too many temp files #68

Closed cathalgking closed 3 months ago

cathalgking commented 4 months ago

I would like to split my BAM file based on a list of cell barcodes (fed in as a csv file with 1 tag per line). So my 10x BAM file should be split based on the CB flag into multiple BAM files based on the tags in the csv file. The example I am dealing with has 3,500 tags in the csv file so 3,500 BAM files should be returned. The max number of tags / BAM's this will need is 5,000.

When using sinto filterbarcodes a bunch of temp files are created in the process and the HPC is not configured to handle that many files at once so the program does not complete. The per-process and Global limits on most HPC's would not seem to be able to handle that many temp files, for example this one is set to: per-process = 1,024 Global = 26 million

Is there any way around this with sinto filterbarcodes?

timoast commented 3 months ago

The number of temp files should scale with the number of cores used, so if it's creating a problem you could try using less cores. If the per-process limit is 1024 files on your HPC, that seems quite low and I'm not sure how you would generate the required 3500 files on that system other than splitting into multiple jobs (eg, filter 1024 cells at a time).

cathalgking commented 3 months ago

Ok. Do you mean by using the parameter -p NPROC, --nproc NPROC? Or in the SLURM header?

timoast commented 3 months ago

The -p parameter not the slurm header

cathalgking commented 3 months ago

That worked thanks @timoast

cathalgking commented 2 weeks ago

@timoast I finally got around to testing this and your right that the number of temp files does scale with the number of cores used when using nproc. Can you explain why that happens? When I set p=1 then 1 temp file was made for each BAM file. When I set p=5 then between 5 and 10 temp files were created for each BAM file but mostly just 5 temp files. Can you explain why this variance?

Is there a way for it to output no temp files and just the BAM files?

Screenshot 2024-09-16 at 2 38 38 PM