Closed sdhutchins closed 6 years ago
Hi Shaurita, Sorry you mean parallel CPUs? Could be done if there's an actual application... What do you need it for?
Thanks Fabio
On November 26, 2017 10:13:13 AM PST, "Shaurita D. Hutchins" notifications@github.com wrote:
Have there been any discussions about provide the use of cores with this this program?
-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/simon-anders/htseq/issues/43
@iosonofabio yes, in particular, I'm looking to improve the efficiency of htseq-count
.
Could you tell me your application? I've parsed billions of sequencing reads in a matter of minutes, so it's unclear to me in what context htseq is too slow...
On November 26, 2017 1:12:59 PM PST, "Shaurita D. Hutchins" notifications@github.com wrote:
@iosonofabio yes, in particular, I'm looking to improve the efficiency of
htseq-count
.-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/simon-anders/htseq/issues/43#issuecomment-347038762
@iosonofabio I'm trying to use the htseq-count scripts on about 120 bam files (human) and it's been running for roughly 300+ hrs. I'm looking for a way to shorten this - perhaps use the htseq-count script with multiprocessing to send each bam file out, read it, then join it to a final output file.
thanks, ok I see the point. The main reason htseq-count is not using multiple cores is that counting genes is a so-called trivially parallelizable algorithm. In other words, there are easy workarounds to achieve the same speed boost without messing with concurrency.
In your specific case, here's what you should do:
if you're running the 120 bam files one after the other, you should run 120 parallel htseq-count programs instead, using one core for each of them. If you are using a cluster, it's gonna take care of it. If you are doing it on your local machine and it has say 6 cores, you should write a simple wrapper script that launches 6 instances of htseq-count in parallel and starts a new one whenever one is done.
if each of the 120 bam files takes 300+ hours, you should split them into smaller bam files (say 10 of them), then compure 1200 htseq-count programs in parallel, and then join the count tables at the end.
I do 1. all the time, and have done 2 in the past for similar reasons. Of course adding multicore support to htseq is an option, but I don't see it happening anytime soon as it's quite a mess to code and concurrent programming is fairly hard to debug.
@iosonofabio is there any concern for the counts changing with doing 1?
I do it all the time, you just have to be careful if you merge the counts into a single file at the end but that goes without saying.
Thanks for your help. I ended up using method 1 and then using pandas to merge the counts files.
good. closing
I still think it is a little bit time-consuming compared with featureCounts
.
I am following this nature protocol - "Count-based differential expression analysis of RNA sequencing data using R and Bioconductor", step 13: "count reads using htseq-count", 7 .sam files will take 3 hours, it is too long!
Please no necrobumping, this thread is closed. Tldr pull requests are very welcome...
Have there been any discussions about provide the use of cores with this this program?