velocyto-team / velocyto.py

RNA velocity estimation in Python
http://velocyto.org/velocyto.py/
BSD 2-Clause "Simplified" License
160 stars 83 forks source link

Multi-threading suggestion #91

Open biounix opened 6 years ago

biounix commented 6 years ago

Hi again,

Related to my previous question #90, I was wondering if the following default approach is the most convenient for most users:

The sorting procedure uses samtools sort and it is expected to be time consumning, because of this, the procedurre is perfomed in parellel by default.

and

..., because of the above mentioned multithreaded call to samtools sort, running several instances of veloctyo run might end up using the memory and cpu of your system and possibly result in runtime errors.

According to my tests, the sorting step makes use of all the CPUs available. Wouldn't it make more sense to run all the steps using only one CPU by default and let the user decide if it is appropriate to take advantage of the multi-threading? I think that's the usual and safer approach. Otherwise, the unaware user can crash not only its own velocyto run but also the rest of processes running in the system. An even more critical point if the system is shared with other users.

The suggestion to

first call samtools sort -t CB -O BAM -o cellsorted_possorted_genome_bam.bam possorted_genome_bam.bam sequentially and only then running velocyto

can pass unnoticed for many users and, in any case, having to run the sorting separately makes the pipeline cumbersome.

Thanks for the otherwise great tool,

gioelelm commented 6 years ago

According to my tests, the sorting step makes use of all the CPUs available. Wouldn't it make more sense to run all the steps using only one CPU by default and let the user decide if it is appropriate to take advantage of the multi-threading? I think that's the usual and safer approach. Otherwise, the unaware user can crash not only its own velocyto run but also the rest of processes running in the system. An even more critical point if the system is shared with other users.

It is a good suggestion but how long would that typically take? Note that this is not a default samtools sorting (e.g. by position or by chromosome, using the index) but a sorting by tag. I remember from my initial testing that this does not scale linearly, also it has to create many more temporary files...

But I am not claiming I really optimized this! If you would provide the timings associated with your suggestion, and if those are reasonable, I will certainly do the change you propose.

Thank you for the note!