Adding ability for using cores

sdhutchins commented 7 years ago

Have there been any discussions about provide the use of cores with this this program?

iosonofabio commented 7 years ago

Hi Shaurita, Sorry you mean parallel CPUs? Could be done if there's an actual application... What do you need it for?

Thanks Fabio

On November 26, 2017 10:13:13 AM PST, "Shaurita D. Hutchins" notifications@github.com wrote:

Have there been any discussions about provide the use of cores with this this program?

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/simon-anders/htseq/issues/43

sdhutchins commented 7 years ago

@iosonofabio yes, in particular, I'm looking to improve the efficiency of htseq-count.

iosonofabio commented 7 years ago

Could you tell me your application? I've parsed billions of sequencing reads in a matter of minutes, so it's unclear to me in what context htseq is too slow...

On November 26, 2017 1:12:59 PM PST, "Shaurita D. Hutchins" notifications@github.com wrote:

@iosonofabio yes, in particular, I'm looking to improve the efficiency of htseq-count.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/simon-anders/htseq/issues/43#issuecomment-347038762

sdhutchins commented 6 years ago

@iosonofabio I'm trying to use the htseq-count scripts on about 120 bam files (human) and it's been running for roughly 300+ hrs. I'm looking for a way to shorten this - perhaps use the htseq-count script with multiprocessing to send each bam file out, read it, then join it to a final output file.

iosonofabio commented 6 years ago

thanks, ok I see the point. The main reason htseq-count is not using multiple cores is that counting genes is a so-called trivially parallelizable algorithm. In other words, there are easy workarounds to achieve the same speed boost without messing with concurrency.

In your specific case, here's what you should do:

if you're running the 120 bam files one after the other, you should run 120 parallel htseq-count programs instead, using one core for each of them. If you are using a cluster, it's gonna take care of it. If you are doing it on your local machine and it has say 6 cores, you should write a simple wrapper script that launches 6 instances of htseq-count in parallel and starts a new one whenever one is done.
if each of the 120 bam files takes 300+ hours, you should split them into smaller bam files (say 10 of them), then compure 1200 htseq-count programs in parallel, and then join the count tables at the end.

I do 1. all the time, and have done 2 in the past for similar reasons. Of course adding multicore support to htseq is an option, but I don't see it happening anytime soon as it's quite a mess to code and concurrent programming is fairly hard to debug.

sdhutchins commented 6 years ago

@iosonofabio is there any concern for the counts changing with doing 1?

iosonofabio commented 6 years ago

I do it all the time, you just have to be careful if you merge the counts into a single file at the end but that goes without saying.

sdhutchins commented 6 years ago

Thanks for your help. I ended up using method 1 and then using pandas to merge the counts files.

iosonofabio commented 6 years ago

good. closing

OnlyBelter commented 5 years ago

I still think it is a little bit time-consuming compared with featureCounts. I am following this nature protocol - "Count-based differential expression analysis of RNA sequencing data using R and Bioconductor", step 13: "count reads using htseq-count", 7 .sam files will take 3 hours, it is too long!

iosonofabio commented 5 years ago

Please no necrobumping, this thread is closed. Tldr pull requests are very welcome...

simon-anders / htseq

Adding ability for using cores #43