Issues splitting BAM files for every cell

mcgilldinglab / MATES

A Deep Learning-Based Model for Quantifying Transposable Elements in Single-Cell Sequencing Data

MIT License

13 stars 0 forks source link

Issues splitting BAM files for every cell #14

Open HannePuype opened 4 days ago

HannePuype commented 4 days ago

Hi! Thank you for developing this tool!

I am trying to run MATES with my data (the test data works) on an HPC environment. However, splitting my BAM files into a BAM file for each cell takes too long. I can only submit jobs on the HPC that last 72 hours, which is too short. How do you handle this with your data? Or am I missing something? I have tried using one sample to see if it works, so I used one thread as recommended.

Thank you for your help! Kind regards, Hanne

Szym29 commented 4 days ago

Hi @HannePuype ,

Thank you for your interest in using our tool! Are you using 10X data or smart-seq data? If it's 10X, you can use more threads. For 10X data, one sample would have thousands of cells, and one thread will process one cell instead of the whole sample at a time , so you can use more threads as you wanted. This could significantly improve the running efficiency. If you are using smart-seq data, this shouldn't be that long since one sample only represents one cell.

Thanks, Yumin

HannePuype commented 4 days ago

Hi Yumin

Thank you for your quick answer!
I am using 10X data.
I am not sure whether using more threads will solve it (but correct me if I'm wrong), because in the MATES script _bamprocessor.py, the following line is executed for the number of batches
batch_size = math.ceil(sample_count / threads_num).
So even when I use e.g. 6 threads, the batch size will still be rounded to 1 (one sample) and the splitting will be done per cell and not in parallel.

Kind regards, Hanne

Szym29 commented 3 days ago

Hi @HannePuype ,

Thanks for raising this issue. We recently implemented a new function that could speed up the bam file splitting and coverage vector building. We are testing the new function and will let you know as soon as possible.

Thanks, Yumin

Szym29 commented 2 days ago

Hi @HannePuype ,

Please kindly check our MATES v0.1.3. Can you check if the bam_propcessor.split_count_10X_data() can help you speed up the preprocessing? We will further improve the efficiency.

Thanks, Yumin

HannePuype commented 16 hours ago

Hi Yumin

Great, thank you! I will test it out as soon as possible and will let you know how it goes!

Hanne