Open HannePuype opened 4 days ago
Hi @HannePuype ,
Thank you for your interest in using our tool! Are you using 10X data or smart-seq data? If it's 10X, you can use more threads. For 10X data, one sample would have thousands of cells, and one thread will process one cell instead of the whole sample at a time , so you can use more threads as you wanted. This could significantly improve the running efficiency. If you are using smart-seq data, this shouldn't be that long since one sample only represents one cell.
Thanks, Yumin
Hi Yumin
Thank you for your quick answer!
I am using 10X data.
I am not sure whether using more threads will solve it (but correct me if I'm wrong), because in the MATES script _bamprocessor.py, the following line is executed for the number of batches
batch_size = math.ceil(sample_count / threads_num)
.
So even when I use e.g. 6 threads, the batch size will still be rounded to 1 (one sample) and the splitting will be done per cell and not in parallel.
Kind regards, Hanne
Hi @HannePuype ,
Thanks for raising this issue. We recently implemented a new function that could speed up the bam file splitting and coverage vector building. We are testing the new function and will let you know as soon as possible.
Thanks, Yumin
Hi @HannePuype ,
Please kindly check our MATES v0.1.3. Can you check if the bam_propcessor.split_count_10X_data()
can help you speed up the preprocessing? We will further improve the efficiency.
Thanks, Yumin
Hi Yumin
Great, thank you! I will test it out as soon as possible and will let you know how it goes!
Hanne
Hi! Thank you for developing this tool!
I am trying to run MATES with my data (the test data works) on an HPC environment. However, splitting my BAM files into a BAM file for each cell takes too long. I can only submit jobs on the HPC that last 72 hours, which is too short. How do you handle this with your data? Or am I missing something? I have tried using one sample to see if it works, so I used one thread as recommended.
Thank you for your help! Kind regards, Hanne