Closed GoogleCodeExporter closed 9 years ago
[deleted comment]
Even though says one embedded worker we have it set to 9 it's just I wanted to
try if changing it to one to check that it was the same.
Original comment by rob234k...@gmail.com
on 9 Oct 2014 at 10:11
Original comment by Maxim.Sc...@gmail.com
on 9 Oct 2014 at 2:22
Hi Rob,
Did you manage to solve this problem. We are also facing the same issue. The
interproscan run keeps on going for 3000 nucleotide transcripts. We ran this in
cluster mode on six nodes.
Thanks,
Reema,
Original comment by reemasin...@gmail.com
on 5 Nov 2014 at 4:11
Hi Reema,
Nope, on their website they say for transcripts from a trinity then need to
break the file up into smaller files i.e. 3000 and submit. I found using
transdecoder to get the protein sequences first rather than letting
interproscan find the orfs and filtering was very much quicker but I think the
ID's need correcting afterwards in order to import into something like blast2go
and match original ID's. I didn't pursue it any further for the moment as we
currently advise not to use on a trinity assembly but just on smaller subsets
that identify as differentially expressed, if want more than blast annotation.
Best wishes
Rob
Original comment by rob234k...@gmail.com
on 5 Nov 2014 at 4:23
Hi Rob, Hi Reema,
Sorry for not replying sooner.
One bottleneck of analysing huge amounts of nucleotide sequences with
InterProScan (I5) seems to be the ORF prediction step using Emboss getorf. I
would say the way this step is integrated in InterProScan is not the most
effient. It does not split up the input file into chunks and spawns multiple
getorf jobs, instead it runs georf against the entire input sequence. This
happens on 1 worker only. So for this step it wouldn't make a difference if you
would change the setting file and increase the number of embedded workers.
As we don't run nucleotide sequence analysis internally we do not have any
figures in terms of performance of this step, but as Rob already suggested
using your own orf prediction implementation speeds things up, but then you
would have to do the mapping on your own.
In addition as already mention file chunking before even submitting to I5
should help as well.
Also you can change the minimum length of predicted orfs in the InterPro
settings file. The attribute is called 'getorf.minsize'.
Best,
Maxim
Original comment by Maxim.Sc...@gmail.com
on 5 Nov 2014 at 4:53
To give you an example about InterProScan's run time, inhouse we are able to
annotate a complete Escherichia coli proteome (~3.000 protein sequences) on our
farm (in CLUSTER mode) within ~3hours.
Original comment by Maxim.Sc...@gmail.com
on 5 Nov 2014 at 5:02
Maxim, could you indicate which parameter settings in the
interproscan.properties files are used to obtain the calculation time that you
mention for the E.coli proteome?
In other words, which parameters (grid.jobs.limit,
worker.number.of.embedded.workers, master.maxconsumers, max.tier.depth ...)
should be changed to speed up the process?
Original comment by stefanie...@gmail.com
on 6 Nov 2014 at 9:08
Will post the parameter settings soon.
Original comment by Maxim.Sc...@gmail.com
on 6 Nov 2014 at 9:43
I have set up a page to document my CLUSTER mode Benchmark runs, including
configuration and infos about the run environment.
https://code.google.com/p/interproscan/wiki/ClusterModeBenchmarkRun
Original comment by Maxim.Sc...@gmail.com
on 14 Nov 2014 at 11:26
I might set up something similar for the STANDALONE mode. Some description
about how to improve the performance in STANDALONE mode can be found here:
https://code.google.com/p/interproscan/wiki/ImprovingPerformance
Original comment by Maxim.Sc...@gmail.com
on 14 Nov 2014 at 11:32
Original comment by Maxim.Sc...@gmail.com
on 30 Jan 2015 at 2:23
Original issue reported on code.google.com by
rob234k...@gmail.com
on 9 Oct 2014 at 10:08