reage / interproscan

Automatically exported from code.google.com/p/interproscan
0 stars 0 forks source link

[interhelp #28771] Interproscan speed problem with embedded workers on our system #54

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
We are having a problem with speed for interproscan and for nucleotide 
sequences that are many it just seems to fail or not end.

We only get one thread on our system even though got many CPU and set 9 
embedded workers therefore it is extremely slow even using our own lookup 
service and when try with a nucleotide trinity assembly it doesn't seem to 
finish. A 14000 protein sequence file may take 2-3 days and work fine but seems 
a little slow. We have not got the cluster mode working yet but thought could 
do it just assigning multiple embedded workers would speed things up? Set java 
memory higher but no effect. properties file and potential problem displayed 
below.

version: interproscan-5.7-48.0

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                160
On-line CPU(s) list:   0-159
Thread(s) per core:    2
Core(s) per socket:    10
CPU socket(s):         8
NUMA node(s):          8
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 47
Stepping:              2
CPU MHz:               2394.016
BogoMIPS:              4787.89
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              30720K

Properties file:
# This is the InterProScan configuration file

##
## Temporary files and directory
##
# The text [UNIQUE], if present, will be replaced by a value unique to your 
running instance

# Temporary files used by the analyses will be placed in directories here:
temporary.file.directory.suffix=[UNIQUE]
temporary.file.directory=/home/usern/galaxy/tmpdir/${temporary.file.directory.su
ffix}

##
## H2 database
##
# The H2 database is copied by the standalone version of interproscan
i5.h2.database.original.location=work/template/interpro.zip
# LOCK_TIMEOUT: Sets the lock timeout (in milliseconds) for the current session
i5.database.connection.url=jdbc:h2:mem:interpro;LOCK_TIMEOUT=10000000

##
## binary paths
##
# Configure the version of perl to use when running member databases perl 
binaries
perl.command=perl

# Binary file locations
binary.hmmer3.path=bin/hmmer/hmmer3/3.1b1
binary.hmmer3.hmmscan.path=bin/hmmer/hmmer3/3.1b1/hmmscan
binary.hmmer3.hmmsearch.path=bin/hmmer/hmmer3/3.1b1/hmmsearch
binary.hmmer2.hmmsearch.path=bin/hmmer/hmmer2/2.3.2/hmmsearch
binary.hmmer2.hmmpfam.path=bin/hmmer/hmmer2/2.3.2/hmmpfam
binary.fingerprintscan.path=bin/prints/fingerPRINTScan
binary.coils.path=bin/coils/ncoils
domainfinder3.path=bin/gene3d/DomainFinder3
binary.prodom.2006.1.prodomblast3i.pl.path=bin/prodom/2006.1/ProDomBlast3i.pl
# Note: Correct prosite binary distribution for your platform can be 
downloaded: ftp://ftp.expasy.org/databases/prosite/tools/ps_scan/
binary.prosite.psscan.pl.path=bin/prosite/ps_scan.pl
binary.prosite.pfscan.path=bin/prosite/pfscan
binary.panther.path=bin/panther/7.0/pantherScore.pl
binary.panther.perl.lib.dir=bin/panther/7.0/lib
binary.superfamily.1.75.ass3.pl.path=bin/superfamily/1.75/ass3_single_threaded.p
l
binary.pirsf.pl.path=bin/pirsf/2.85/pirsf.pl
binary.blastall.2.2.6.path=bin/blast/2.2.6/blastall
binary.blast.2.2.19.path=bin/blast/2.2.19
binary.getorf.path=bin/nucleotide/getorf
# Note: SignalP binary not distributed with InterProScan 5, please install 
separately e.g. in bin/signalp/4.0/signalp
binary.signalp.4.0.path=/home/apps/scripts/signalp
# Note: TMHMM binary not distributed with InterProScan 5, please install 
separately e.g. in bin/tmhmm/2.0c/decodeanhmm
binary.tmhmm.path=/home/apps/bin/decodeanhmm
# Note: Phobius binary not distributed with InterProScan 5, please install 
separately e.g. in bin/phobius/1.01/phobius.pl
binary.phobius.pl.path.1.01=/home/apps/scripts/phobius.pl

##
##  Member database model / data file locations (alphabetically sorted)
##
# Member database model / data file locations (alphabetically sorted)
coils.new_coil.mat.path.2.2=data/coils/2.2/new_coil.mat
gene3d.hmm.path.3.5.0=data/gene3d/3.5.0/gene3d_classified.hmm
gene3d.model2sf_map.path.3.5.0=data/gene3d/3.5.0/model_to_family_map.csv
hamap.profile.models.path.201311.27=data/hamap/201311.27/hamap.prf
# It is IMPORTANT to set this temporary directory to a directory on LOCAL disk -
# network IO will slow the panther analysis down considerably.
panther.temporary.file.directory=/tmp/
panther.models.dir.9.0=data/panther/9.0/model
Pfam-A.hmm.path.27.0=data/pfam/27.0/Pfam-A.hmm
Pfam-A.seed.path.27.0=data/pfam/27.0/Pfam-A.seed
Pfam-A.hmm.path.26.0=data/pfam/26.0/Pfam-A.hmm
Pfam-A.seed.path.26.0=data/pfam/26.0/Pfam-A.seed
Pfam-C.path.27.0=data/pfam/27.0/Pfam-C
#Version 2.84
pirsf.hmm.bin.path.2.84=data/pirsf/2.84/sf_hmm.bin
pirsf.hmm.subf.bin.path.2.84=data/pirsf/2.84/sf_hmm_subf.bin
pirsf.hmm.path.2.84=data/pirsf/2.84/sf_hmm
pirsf.hmm.subf.path.2.84=data/pirsf/2.84/sf_hmm_subf
pirsf.dat.path.2.84=data/pirsf/2.84/pirsf.dat
pirsf.sf.tb.path.2.84=data/pirsf/2.84/sf.tb
pirsf.sf.seq.path.2.84=data/pirsf/2.84/sf.seq

prints.kdat.path.42.0=data/prints/42.0/prints42_0.kdat
prints.pval.path.42.0=data/prints/42.0/prints.pval
prints.hierarchy.path.42.0=data/prints/42.0/FingerPRINTShierarchy.db
prodom.ipr.path.2006.1=data/prodom/2006.1/prodom.ipr
prosite.models.path.20.97=data/prosite/20.97/prosite.dat
prosite.evaluator.models.path.20.97=data/prosite/20.97/evaluator.dat
signalp.4.0.perl.library.dir=bin/signalp/4.0/lib
# Note: Smart overlapping and threshold files not distributed with InterProScan 
5, please install separately e.g. in data/smart/6.2
smart.hmm.path.6.2=data/smart/6.2/smart.HMMs
smart.hmm.bin.path.6.2=data/smart/6.2/smart.HMMs.bin
smart.overlapping.path.6.2=
smart.threshold.path.6.2=
superfamily.hmm.path.3.0=data/superfamily/1.75/hmmlib_1.75
superfamily.self.hits.path.1.75=data/superfamily/1.75/self_hits.tab
superfamily.cla.path.1.75=data/superfamily/1.75/dir.cla.scop.txt_1.75
superfamily.model.tab.path.1.75=data/superfamily/1.75/model.tab
superfamily.pdbj95d.path.1.75=data/superfamily/1.75/pdbj95d
tigrfam.hmm.path.13.0=data/tigrfam/13.0/TIGRFAMs_13.0_HMM.LIB
# Note: TMHMM model files not distributed with InterProScan 5, please install 
separately e.g. in data/tmhmm/2.0/TMHMM2.0.model
tmhmm.model.path=data/tmhmm/2.0/TMHMM2.0.model

##
## cpu options for parallel processing
##

#hmmer cpu options for the different jobs
hmmer3.hmmsearch.cpu.switch.pfama=--cpu 8
hmmer3.hmmsearch.cpu.switch.tigrfam=--cpu 8
hmmer3.hmmsearch.cpu.switch.gene3d=--cpu 8
hmmer3.hmmsearch.cpu.switch.superfamily=--cpu 8

hmmer2.hmmpfam.cpu.switch.smart=--cpu 8
hmmer2.hmmpfam.cpu.switch.pirsf=--cpu 8

#blastall cpu options
blastall.cpu.switch.pirsf=-a 8

#panther binary cpu options (for blastall and hmmsearch)
panther.binary.cpu.switch=-c 8

#pirsf binary cpu options (for hmmscan)
pirsf.pl.binary.cpu.switch=-cpu 8

##
## max number of proteins per analysis batch
##
# These values control the maximum number of proteins put through
# an analysis in one go - different algorithms have different optimum values.
# Note that if you suffer from out of memory errors, reducing these values
# will almost certainly help, but may reduce the speed of analysis.
analysis.max.sequence.count.TMHMM=100
analysis.max.sequence.count.PANTHER=100
analysis.max.sequence.count.SMART=50
analysis.max.sequence.count.TIGRFAM_9=50
analysis.max.sequence.count.TIGRFAM_10=100
analysis.max.sequence.count.GENE3D=50
analysis.max.sequence.count.PRINTS=100
analysis.max.sequence.count.PROSITE_PROFILES=100
analysis.max.sequence.count.PROSITE_PATTERNS=100
analysis.max.sequence.count.PIRSF=50
analysis.max.sequence.count.PRODOM=100
analysis.max.sequence.count.SSF=50
analysis.max.sequence.count.HAMAP=100
analysis.max.sequence.count.PFAM_A=100
analysis.max.sequence.count.COILS=100
analysis.max.sequence.count.PHOBIUS=100
analysis.max.sequence.count.SIGNALP=100

##
##  General settings
##

# If multiple hosts are sharing the same file system, a delay may be required to
# avoid stale NFS handles
# nfs.delay.milliseconds=0

# Instructs I5 to completely clean up after itself - leave set to true.
delete.temporary.directory.on.completion=true

##
## Broker TCP Connection
##

# A list of TCP ports that should not be used for messaging. (Apart from this, 
only ports > 1024 and < 65535 will be used.)
tcp.port.exclusion.list=3879,3878,3881,3882

##
##  precalculated match lookup service
##
# By default, if the sequence already has matches available from the EBI, this 
service will look them
# up for you.  Note - at present it will always return all the available 
matches, ignoring any -appl options
# set on the command line.
precalculated.match.lookup.service.url=http://interproscan:8081

#proxy set up
precalculated.match.lookup.service.proxy.host=
precalculated.match.lookup.service.proxy.port=8081

##
## getorf configuration for nucleic acid sequences
##
# the following are roughly the times getorf takes to find sequences of open 
reading frames (ORFs) in n nucleotide sequences
#number of sequences -> approx. time it takes in our tests
#        600000 -> 10 minutes
#        3600000 -> 1 hour
#        7200000 -> 2 hours
#        43200000 -> 12 hours

# JOB: jobLoadNucleicAcidSequence
getorf.minsize=50

##
## Output format
##
# TRUE by default, which means all generated graphical output documents (only 
SVG at the moment) will be archived (using the Linux command tar).
# This simple switch allows you to switch the archive mode off (simply set it 
to FALSE).
archiveSVGOutput=true

##
## Master/Stand alone embedded workers
##

# Set the number of embedded workers to the number of processors that you would 
like to employ
# on the machine you are using to run InterProScan.
#number of embedded workers  a master process can have
number.of.embedded.workers=1
maxnumber.of.embedded.workers=1

##
## Distributed mode (Cluster mode)
##

#grid name
grid.name=sge
#grid.name=other-cluster

#Java Virtual Machine (JVM) maximum idle time for jobs.
#Default is 180 seconds, if not specified. When reached the worker will 
shutdown.
jvm.maximum.idle.time.seconds=180

#JVM maximum life time for workers.
#Default is 14400 seconds, if not specified. After this period has passed the 
worker will shutdown unless it is busy.
jvm.maximum.life.seconds=14400

#project name for this run  - use user.digest
user.digest=i5GridRun

#grid jobs limit : number of jobs you are allowed to run on the cluster
grid.jobs.limit=1000

#time between each bjobs or qstat command to check the status of jobs on the 
cluster
grid.check.interval.seconds=120

#allow master interproscan to run binaries ()
master.can.run.binaries=true

#deal with unknown step states
recover.unknown.step.state=false

#Grid submission commands (e.g. LSF bsub or SGE qsub) for starting remote 
workers
#commands the master uses to start new remote workers
grid.master.submit.command=qsub -cwd -V -b y -N i5t1worker
grid.master.submit.high.memory.command=qsub -cwd -V -b y -N i5t1hmworker

#commands a worker uses to start new remote workers
grid.worker.submit.command=qsub -cwd -V -b y -N i5t2worker
grid.worker.submit.high.memory.command=qsub -cwd -V -b y -N i5t2hmworker

# command to start a new worker (new jvm)
worker.command=java -Xms32m -Xmx2048m -jar interproscan-5.jar
# This may be identical to the worker.command argument above, however you may 
choose to select
# a machine with a much larger available memory, for use when a StepExecution 
fails.
worker.high.memory.command=java -Xms32m -Xmx2048m -jar interproscan-5.jar

#directory for any log files generated by InterProScan
log.dir=/home/usern/galaxy/tmpdir/${temporary.file.directory.suffix}/logs

# Set the number of embedded workers to the number of processors that you would 
like to employ
# on the node machine on which the worker will run.
#number of embedded workers in a remote worker
worker.number.of.embedded.workers=1
worker.maxnumber.of.embedded.workers=4

# max number of connections to the master
master.maxconsumers=64

#number of connections to the worker
worker.maxconsumers=32

#throttled network?
grid.throttle=true

# max number of jobs a tier 1 worker is allowed on its queue
worker.maxunfinished.jobs=64

#network tier depth
max.tier.depth=1

# Active MQ JMS broker temporary data directory
jms.broker.temp.directory=activemq-data/localhost/tmp_storage

POTENTIAL PROBLEM:
42411 futex(0x7f6860117c28, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
42410 futex(0x7f6860117c28, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
42411 <... futex resumed> )             = -1 EAGAIN (Resource temporarily 
unavailable)
42411 futex(0x7f6860117c28, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
42410 <... futex resumed> )             = 0
42403 futex(0x7f686000a154, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1, 
{1412848310, 969585000}, ffffffff <unfinished ...>
42411 <... futex resumed> )             = 0
42411 futex(0x7f6860117c54, FUTEX_WAIT_PRIVATE, 1773, NULL <unfinished ...>
42410 futex(0x7f6860115254, FUTEX_WAIT_PRIVATE, 1715, NULL <unfinished ...>
42413 <... futex resumed> )             = -1 ETIMEDOUT (Connection timed out)
42413 futex(0x7f6860124c28, FUTEX_WAKE_PRIVATE, 1) = 0
42413 futex(0x7f6860124c54, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1, 
{1412848310, 965379000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
42413 futex(0x7f6860124c28, FUTEX_WAKE_PRIVATE, 1) = 0
42413 futex(0x7f6860124c54, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1, 
{1412848311, 15695000}, ffffffff <unfinished ...>
42403 <... futex resumed> )             = -1 ETIMEDOUT (Connection timed out)
42403 futex(0x7f686000a128, FUTEX_WAKE_PRIVATE, 1) = 0
42403 futex(0x7f686000a154, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1, 
{1412848311, 75280000}, ffffffff <unfinished ...>
42418 read(154, 
"TTATTAAAAATATGCAAAAAGAACTTCGGGATGATACCAGGATGTCGTCGCCGAA\nACTCAAGGCTTTGTTCGAAAAG
AGGTTTTATATAGGCAATCGAGTCATCCACAAAGAAAT\nAAACCCCATGTAATATAATAAAAAGGCAAAAGTGTAATCT
ATTGTAACTATGTGTTCTCT\nAAATGTAGCAAATGAACAAGCAACATCAAGCCTGTCCCGATTTATCCTGACCACCGCG
AA\nTTATATATATATATATATA\n>c122979_g1_i1 len=409 path=[589:0-78 668:79-185 
203:186-408]\nGTTGTAGTTGACAGTTATCGCGAATAGCAACGAGAGAGCACAGAGCATTTATGGGAATAC\nGAAG
CTATTTATGTGACATCTTATAAAGAACCTCAGCTAAATCTAGCTGACGCGGTCGGT\nGTTGGCGAACTTCAGCTGGCGA
AAGCAGAAGGCGTAAATGAAAACACGCATCTCATCTTT\nACATCTCTCCAACTGGGCGTGTTGTGATTTGATGATATTA
ATCACCATTTCTCGCTCCTG\nTCATTCGTCATTGCAATACATTTAGCATAAGAGAGCTCTTTAAAAATGACTGTAAAAT
CC\nTTGCCCATTTATTACTAGGTTGATAGCAATAAACGATGACCAGGTTAATCGCTAGTCCAA\nCTAAACCACTTAAA
TAAATAACTATGAAAATTAAAAATCTGTATTCGGA\n>c122980_g1_i1 len=383 
path=[629:0-382]\nAGCAGATACTCTGATTTAAGCTTCGTGGATGGATTAAAGTGTGTGTGTTTTAAATGAAAC\n
ATGCCATATGGAGGGAAAGTGAGATCCGTTTTGTCCGTAAGACTTGATGCATTTGGTTAA\nAACCCGATCTCTTTGTTT
CAAATTCACTAACAAGAACAGCTTTTAGTCTTAACCGAGCAT\nCATCGTGTGGCCTGACTTGTGAGGCCAAATGTCGCC
TCTGCTTACTAGAACACTTACATA\nTAGCTTACAGTGAGGTTCCCTTAATCCCTTAGCGGCAAACAGTCTAAAAAATTG
TTGGTT\nGCTAAAAATTGCTGCAAATAACATTATTAGTTGTAATGTCTGCAACTAAGAAGCGCCGAA\nCTGGAAGCAT
TTCTCACCATCCA\n>c122982_g1_i1 len=398 
path=[1:0-397]\nCGGGTATTAGACTTAATTTGCAATCCTTTTGTTTTGTTTCACTGTAAAAGAAATCAGCCT\nTT
ATATCACGAACTCCACGATGCACTTGATCTCGATGTTTCGAAAAAACC

Original issue reported on code.google.com by rob234k...@gmail.com on 9 Oct 2014 at 10:08

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Even though says one embedded worker we have it set to 9 it's just I wanted to 
try if changing it to one to check that it was the same.

Original comment by rob234k...@gmail.com on 9 Oct 2014 at 10:11

GoogleCodeExporter commented 9 years ago

Original comment by Maxim.Sc...@gmail.com on 9 Oct 2014 at 2:22

GoogleCodeExporter commented 9 years ago
Hi Rob,

Did you manage to solve this problem. We are also facing the same issue. The 
interproscan run keeps on going for 3000 nucleotide transcripts. We ran this in 
cluster mode on six nodes.

Thanks,
Reema,

Original comment by reemasin...@gmail.com on 5 Nov 2014 at 4:11

GoogleCodeExporter commented 9 years ago
Hi Reema,

Nope, on their website they say for transcripts from a trinity then need to 
break the file up into smaller files i.e. 3000 and submit. I found using 
transdecoder to get the protein sequences first rather than letting 
interproscan find the orfs and filtering was very much quicker but I think the 
ID's need correcting afterwards in order to import into something like blast2go 
and match original ID's. I didn't pursue it any further for the moment as we 
currently advise not to use on a trinity assembly but just on smaller subsets 
that identify as differentially expressed, if want more than blast annotation.

Best wishes

Rob

Original comment by rob234k...@gmail.com on 5 Nov 2014 at 4:23

GoogleCodeExporter commented 9 years ago
Hi Rob, Hi Reema,
Sorry for not replying sooner.

One bottleneck of analysing huge amounts of nucleotide sequences with 
InterProScan (I5) seems to be the ORF prediction step using Emboss getorf. I 
would say the way this step is integrated in InterProScan is not the most 
effient. It does not split up the input file into chunks and spawns multiple 
getorf jobs, instead it runs georf against the entire input sequence. This 
happens on 1 worker only. So for this step it wouldn't make a difference if you 
would change the setting file and increase the number of embedded workers.
As we don't run nucleotide sequence analysis internally we do not have any 
figures in terms of performance of this step, but as Rob already suggested 
using your own orf prediction implementation speeds things up, but then you 
would have to do the mapping on your own.
In addition as already mention file chunking before even submitting to I5 
should help as well.
Also you can change the minimum length of predicted orfs in the InterPro 
settings file. The attribute is called 'getorf.minsize'.

Best,
Maxim

Original comment by Maxim.Sc...@gmail.com on 5 Nov 2014 at 4:53

GoogleCodeExporter commented 9 years ago
To give you an example about InterProScan's run time, inhouse we are able to 
annotate a complete Escherichia coli proteome (~3.000 protein sequences) on our 
farm (in CLUSTER mode) within ~3hours.

Original comment by Maxim.Sc...@gmail.com on 5 Nov 2014 at 5:02

GoogleCodeExporter commented 9 years ago
Maxim, could you indicate which parameter settings in the 
interproscan.properties files are used to obtain the calculation time that you 
mention for the E.coli proteome?

In other words, which parameters (grid.jobs.limit, 
worker.number.of.embedded.workers, master.maxconsumers, max.tier.depth ...) 
should be changed to speed up the process?

Original comment by stefanie...@gmail.com on 6 Nov 2014 at 9:08

GoogleCodeExporter commented 9 years ago
Will post the parameter settings soon.

Original comment by Maxim.Sc...@gmail.com on 6 Nov 2014 at 9:43

GoogleCodeExporter commented 9 years ago
I have set up a page to document my CLUSTER mode Benchmark runs, including 
configuration and infos about the run environment.
https://code.google.com/p/interproscan/wiki/ClusterModeBenchmarkRun 

Original comment by Maxim.Sc...@gmail.com on 14 Nov 2014 at 11:26

GoogleCodeExporter commented 9 years ago
I might set up something similar for the STANDALONE mode. Some description 
about how to improve the performance in STANDALONE mode can be found here:
https://code.google.com/p/interproscan/wiki/ImprovingPerformance

Original comment by Maxim.Sc...@gmail.com on 14 Nov 2014 at 11:32

GoogleCodeExporter commented 9 years ago

Original comment by Maxim.Sc...@gmail.com on 30 Jan 2015 at 2:23