Open fescudie opened 9 years ago
The CPU usage is likely proportional to the number of terminal taxa in the training set. There are about 2000 genera (terminal taxa) in the default 16S taxonomy. You can request less than 1G memory. Do you know how many terminal taxa in your training set?
My taxonomy contains 82561 terminal taxa. The memory used cannot be reduced because the taxonomy is very large. When I have reduced memory the classifier has returned an out of memory error. The threads seem opened only when the classifier loads the taxonomy not for classification. Why these threads are necessary ? It is not possible to load taxonomy with only one thread ? Actually, to solve this problem I use taskset. With this program all the threads run on the same CPU. But this is not the best solution.
We haven't used Classifier on large number of terminal taxa. The largest on we have is the Fungal ITS UNITE training set containing 20,221 species. The Classifier uses a single thread but the Java garbage collection may use more threads if lack of memory. I am wondering if you can try use 30 or 40GB memory just to see if the number of threads being used can be reduced.
On Thu, Apr 9, 2015 at 3:23 AM, fescudie notifications@github.com wrote:
My taxonomy contains 82561 terminal taxa. The memory used cannot be reduced because the taxonomy is very large. When I have reduced memory the classifier has returned an out of memory error. The threads seem opened only when the classifier loads the taxonomy not for classification. Why these threads are necessary ? It is not possible to load taxonomy with only one thread ? Actually, to solve this problem I use taskset. With this program all the threads run on the same CPU. But this is not the best solution.
— Reply to this email directly or view it on GitHub https://github.com/rdpstaff/classifier/issues/9#issuecomment-91135073.
Qiong
With 100GB I have the same problem. I will continu to used taskset.
This is interesting. I am wondering if you would like to share your training files with me for further debugging.
Qiong
On Mon, Apr 13, 2015 at 6:59 AM, fescudie notifications@github.com wrote:
With 100GB I have the same problem. I will continu to used taskset.
— Reply to this email directly or view it on GitHub https://github.com/rdpstaff/classifier/issues/9#issuecomment-92309873.
Qiong
You can get the training files at this URL: http://genoweb.toulouse.inra.fr/~fescudie/. But this phenomena is already present with the training example dataset in RDPTools (120% CPU):
java -Xmx1g -jar path/to/classifier.jar train -o mytrained -s path/to/RDPTools/classifier/samplefiles/new_trainset.fasta -t path/to/RDPTools/classifier/samplefiles/new_trainset_db_taxid.txt
cp path/to/RDPTools/classifier/samplefiles/rRNAClassifier.properties mytrained
java -Xmx15g -jar path/to/classifier.jar classify -c 0.8 -t mytrained/rRNAClassifier.properties -o result.rdp sub.fasta
Consumption:
top - 10:23:54 up 56 days, 23:09, 0 users, load average: 9.19, 8.95, 10.32
Tasks: 840 total, 10 running, 830 sleeping, 0 stopped, 0 zombie
Cpu(s): 25.5%us, 0.1%sy, 0.0%ni, 74.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 264438700k total, 78953232k used, 185485468k free, 174720k buffers
Swap: 16777208k total, 36100k used, 16741108k free, 64773140k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
66617 fescudie 20 0 18.4g 590m 10m S 120.6 0.2 0:29.30 java
66655 fescudie 20 0 13684 1784 884 R 0.7 0.0 0:00.13 top
65432 fescudie 20 0 104m 1948 1408 S 0.0 0.0 0:00.30 bash
Hi,
When I use RDP classifier with my own databank (a very large 16S databank) the CPU usage of RDP is unacceptable : up to 2360% (see below). This phenomena doesn't appear with the default databank and is more reduced with the databank provided in example of RDP train classifier. How can I reduce the CPU consumption/nb threads of RDP classifier ?
Command with my databank:
Consumption:
Consumption with threads:
Command with RDP default databank:
Consumption:
Command with 'Example command to train classifier':
Consumption:
Thanks in advance.