nanoporetech / pinfish

Tools to annotate genomes using long read transcriptomics data
Other
44 stars 13 forks source link

cluster_gff not multithreading #1

Closed noncodo closed 5 years ago

noncodo commented 5 years ago

I'm playing around with this pipeline, trying to implement it in a resource-optimised SGE pipeline (which I will gladly share once completed). It seems that cluster_gff is only using one thread for the clustering bit. Monitoring the job indicates it is mostly running with 1 thread (~100%), and I could see it go to ~120% when reading/writing to NVMe drives. Could the parallelisation only be implemented for read/write operations? Apologies, I don't speak GO.

bsipos commented 5 years ago

Hi,

The input of cluster_gff is pretty unstructured (sorted GFF). Due to this there was no easy way to make clustering parallelised. I could maybe introduce a bit more parallelism, but I don't think that the speed gains would justify the increase in code complexity.

Clustering could be parallelised though "manually" by partitioning the input data into non-overlapping "loci". However, the bottleneck in the whole pipeline in my experience is polishing rather than clustering so I do not have plans to implement this.

noncodo commented 5 years ago

So, splitting input by chromosome might be a very HPC friendly way to speed up the clustering step then? I presume this would help polishing as well (e.g. massive SGE job array). So far, this step is taking 4+ days of single threaded compute for some PromethION data.

bsipos commented 5 years ago

Yes, splitting by chromosome would make sense and should be be easy to achieve.

noncodo commented 5 years ago

FYI this took 21 days to complete on a PromethION cDNA run

bsipos commented 5 years ago

Not blazing fast then :) Was it just one PromethION chip? Did you split the data by chromosome in the end? I do not routinely test pinfish on PromethION data but I have one around so might do an evaluation myself.

noncodo commented 5 years ago

One PromethION + one GridION run, about 50M reads. This was run with default parameters on the bulk GFF. However, I am keen to make a Sun Grid Engine fork that will speed up compute for those with access to a SGE job scheduler. If it takes 21 days to cluster GFFs, I can only imagine how long it will take to polish.

bsipos commented 5 years ago

It seems I have really have to look into scaling. Only tested on a ~10 million reads dataset so far. One more question: was this direct RNA or cDNA? BTW, the bottleneck in polishing is the minimap2 alignment and the racon polishing. Both tools will use multiple cores. Though racon polishing can take a lot of time for high coverage transcripts. I could implement downsampling of coverage which would speed things up, most likely without much loss in the polished accuracy.