rdpstaff / classifier

RDP extensible sequence classifier for fungal lsu, bacterial and archaeal 16s
GNU General Public License v2.0
53 stars 32 forks source link

Classifier and CPU consumption #9

Open fescudie opened 9 years ago

fescudie commented 9 years ago

Hi,

When I use RDP classifier with my own databank (a very large 16S databank) the CPU usage of RDP is unacceptable : up to 2360% (see below). This phenomena doesn't appear with the default databank and is more reduced with the databank provided in example of RDP train classifier. How can I reduce the CPU consumption/nb threads of RDP classifier ?

Command with my databank:

java -Xmx15g -jar path/to/classifier.jar classify -c 0.8 -t path/to/my_bank.properties -o result.rdp sub.fasta

Consumption:

top - 09:51:00 up 56 days, 22:36,  0 users,  load average: 15.10, 23.87, 20.76
Tasks: 840 total,  11 running, 829 sleeping,   0 stopped,   0 zombie
Cpu(s): 81.2%us,  0.2%sy,  0.0%ni, 18.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  264438700k total, 84939736k used, 179498964k free,   174172k buffers
Swap: 16777208k total,    36100k used, 16741108k free, 64703676k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                        
 65765 fescudie  20   0 18.4g 6.4g  10m S 2360.5  2.6   4:59.91 java                                                                                                                                         
 65850 fescudie  20   0 13684 1776  880 R  0.7  0.0   0:00.05 top                                                                                                                                            
 65432 fescudie  20   0  104m 1948 1408 S  0.0  0.0   0:00.15 bash 

Consumption with threads:

top - 10:33:10 up 56 days, 23:18,  0 users,  load average: 14.83, 10.51, 10.28
Tasks: 1305 total,  11 running, 1294 sleeping,   0 stopped,   0 zombie
Cpu(s): 41.4%us,  2.5%sy,  0.0%ni, 56.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  264438700k total, 83889500k used, 180549200k free,   174876k buffers
Swap: 16777208k total,    36100k used, 16741108k free, 64773160k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                        
 66871 fescudie  20   0 18.4g 5.3g   9m R 70.3  2.1   0:16.90 java                                                                                                                                           
 66876 fescudie  20   0 18.4g 5.3g   9m S 29.7  2.1   0:02.20 java                                                                                                                                           
 66889 fescudie  20   0 18.4g 5.3g   9m S 29.7  2.1   0:02.31 java                                                                                                                                           
 66891 fescudie  20   0 18.4g 5.3g   9m S 29.7  2.1   0:02.22 java                                                                                                                                           
 66897 fescudie  20   0 18.4g 5.3g   9m S 29.7  2.1   0:02.27 java                                                                                                                                           
 66878 fescudie  20   0 18.4g 5.3g   9m S 29.4  2.1   0:02.05 java                                                                                                                                           
 66879 fescudie  20   0 18.4g 5.3g   9m S 29.4  2.1   0:02.01 java                                                                                                                                           
 66881 fescudie  20   0 18.4g 5.3g   9m S 29.4  2.1   0:02.07 java                                                                                                                                           
 66882 fescudie  20   0 18.4g 5.3g   9m S 29.4  2.1   0:02.13 java                                                                                                                                           
 66884 fescudie  20   0 18.4g 5.3g   9m S 29.4  2.1   0:01.99 java                                                                                                                                           
 66886 fescudie  20   0 18.4g 5.3g   9m S 29.4  2.1   0:02.19 java                                                                                                                                           
 66890 fescudie  20   0 18.4g 5.3g   9m S 29.4  2.1   0:02.12 java                                                                                                                                           
 66892 fescudie  20   0 18.4g 5.3g   9m S 29.4  2.1   0:02.16 java                                                                                                                                           
 66893 fescudie  20   0 18.4g 5.3g   9m S 29.4  2.1   0:02.29 java                                                                                                                                           
 66894 fescudie  20   0 18.4g 5.3g   9m S 29.4  2.1   0:01.68 java                                                                                                                                           
 66895 fescudie  20   0 18.4g 5.3g   9m S 29.4  2.1   0:02.04 java                                                                                                                                           
 66896 fescudie  20   0 18.4g 5.3g   9m S 29.4  2.1   0:02.27 java                                                                                                                                           
 66898 fescudie  20   0 18.4g 5.3g   9m S 29.4  2.1   0:02.11 java                                                                                                                                           
 66875 fescudie  20   0 18.4g 5.3g   9m S 29.1  2.1   0:02.22 java                                                                                                                                           
 66877 fescudie  20   0 18.4g 5.3g   9m S 29.1  2.1   0:02.26 java                                                                                                                                           
 66899 fescudie  20   0 18.4g 5.3g   9m S 29.1  2.1   0:02.26 java                                                                                                                                           
 66885 fescudie  20   0 18.4g 5.3g   9m S 28.7  2.1   0:02.13 java                                                                                                                                           
 66880 fescudie  20   0 18.4g 5.3g   9m S 28.4  2.1   0:02.19 java                                                                                                                                           
 66874 fescudie  20   0 18.4g 5.3g   9m S 28.1  2.1   0:02.01 java                                                                                                                                           
 66872 fescudie  20   0 18.4g 5.3g   9m S 26.8  2.1   0:01.99 java                                                                                                                                           
 66873 fescudie  20   0 18.4g 5.3g   9m S 26.1  2.1   0:02.00 java                                                                                                                                           
 66883 fescudie  20   0 18.4g 5.3g   9m S 24.1  2.1   0:02.03 java                                                                                                                                           
 66888 fescudie  20   0 18.4g 5.3g   9m S 22.1  2.1   0:01.62 java                                                                                                                                           
 66887 fescudie  20   0 18.4g 5.3g   9m S 21.8  2.1   0:01.92 java                                                                                                                                           
 66912 fescudie  20   0 14080 2168  884 R  1.0  0.0   0:00.11 top                                                                                                                                            
 65432 fescudie  20   0  104m 1948 1408 S  0.0  0.0   0:00.44 bash                                                                                                                                           
 66870 fescudie  20   0 18.4g 5.3g   9m S  0.0  2.1   0:00.00 java                                                                                                                                           
 66900 fescudie  20   0 18.4g 5.3g   9m S  0.0  2.1   0:00.00 java                                                                                                                                           
 66901 fescudie  20   0 18.4g 5.3g   9m S  0.0  2.1   0:00.00 java                                                                                                                                           
 66902 fescudie  20   0 18.4g 5.3g   9m S  0.0  2.1   0:00.00 java                                                                                                                                           
 66903 fescudie  20   0 18.4g 5.3g   9m S  0.0  2.1   0:00.00 java                                                                                                                                           
 66904 fescudie  20   0 18.4g 5.3g   9m S  0.0  2.1   0:00.10 java                                                                                                                                           
 66905 fescudie  20   0 18.4g 5.3g   9m S  0.0  2.1   0:00.09 java                                                                                                                                           
 66906 fescudie  20   0 18.4g 5.3g   9m S  0.0  2.1   0:00.00 java                                                                                                                                           
 66907 fescudie  20   0 18.4g 5.3g   9m S  0.0  2.1   0:00.00 java  

Command with RDP default databank:

java -Xmx15g -jar path/to/classifier.jar classify -c 0.8 -o result.rdp sub.fasta

Consumption:

top - 09:53:41 up 56 days, 22:39,  0 users,  load average: 9.96, 17.82, 18.93
Tasks: 840 total,  10 running, 830 sleeping,   0 stopped,   0 zombie
Cpu(s): 25.0%us,  0.0%sy,  0.0%ni, 75.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  264438700k total, 78978564k used, 185460136k free,   174216k buffers
Swap: 16777208k total,    36100k used, 16741108k free, 64768832k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                        
 65863 fescudie  20   0 18.4g 703m  10m S 100.1  0.3   1:18.87 java                                                                                                                                          
 65917 fescudie  20   0 13684 1784  880 R  0.3  0.0   0:00.36 top                                                                                                                                            
 65432 fescudie  20   0  104m 1948 1408 S  0.0  0.0   0:00.16 bash      

Command with 'Example command to train classifier':

java -Xmx1g -jar path/to/classifier.jar train -o mytrained -s path/to/RDPTools/classifier/samplefiles/new_trainset.fasta -t path/to/RDPTools/classifier/samplefiles/new_trainset_db_taxid.txt
cp path/to/RDPTools/classifier/samplefiles/rRNAClassifier.properties mytrained
java -Xmx15g -jar path/to/classifier.jar classify -c 0.8 -t mytrained/rRNAClassifier.properties -o result.rdp sub.fasta

Consumption:

top - 10:23:54 up 56 days, 23:09,  0 users,  load average: 9.19, 8.95, 10.32
Tasks: 840 total,  10 running, 830 sleeping,   0 stopped,   0 zombie
Cpu(s): 25.5%us,  0.1%sy,  0.0%ni, 74.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  264438700k total, 78953232k used, 185485468k free,   174720k buffers
Swap: 16777208k total,    36100k used, 16741108k free, 64773140k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                        
 66617 fescudie  20   0 18.4g 590m  10m S 120.6  0.2   0:29.30 java                                                                                                                                          
 66655 fescudie  20   0 13684 1784  884 R  0.7  0.0   0:00.13 top                                                                                                                                            
 65432 fescudie  20   0  104m 1948 1408 S  0.0  0.0   0:00.30 bash

Thanks in advance.

wangqion commented 9 years ago

The CPU usage is likely proportional to the number of terminal taxa in the training set. There are about 2000 genera (terminal taxa) in the default 16S taxonomy. You can request less than 1G memory. Do you know how many terminal taxa in your training set?

fescudie commented 9 years ago

My taxonomy contains 82561 terminal taxa. The memory used cannot be reduced because the taxonomy is very large. When I have reduced memory the classifier has returned an out of memory error. The threads seem opened only when the classifier loads the taxonomy not for classification. Why these threads are necessary ? It is not possible to load taxonomy with only one thread ? Actually, to solve this problem I use taskset. With this program all the threads run on the same CPU. But this is not the best solution.

wangqion commented 9 years ago

We haven't used Classifier on large number of terminal taxa. The largest on we have is the Fungal ITS UNITE training set containing 20,221 species. The Classifier uses a single thread but the Java garbage collection may use more threads if lack of memory. I am wondering if you can try use 30 or 40GB memory just to see if the number of threads being used can be reduced.

On Thu, Apr 9, 2015 at 3:23 AM, fescudie notifications@github.com wrote:

My taxonomy contains 82561 terminal taxa. The memory used cannot be reduced because the taxonomy is very large. When I have reduced memory the classifier has returned an out of memory error. The threads seem opened only when the classifier loads the taxonomy not for classification. Why these threads are necessary ? It is not possible to load taxonomy with only one thread ? Actually, to solve this problem I use taskset. With this program all the threads run on the same CPU. But this is not the best solution.

— Reply to this email directly or view it on GitHub https://github.com/rdpstaff/classifier/issues/9#issuecomment-91135073.

Qiong

fescudie commented 9 years ago

With 100GB I have the same problem. I will continu to used taskset.

wangqion commented 9 years ago

This is interesting. I am wondering if you would like to share your training files with me for further debugging.

Qiong

On Mon, Apr 13, 2015 at 6:59 AM, fescudie notifications@github.com wrote:

With 100GB I have the same problem. I will continu to used taskset.

— Reply to this email directly or view it on GitHub https://github.com/rdpstaff/classifier/issues/9#issuecomment-92309873.

Qiong

fescudie commented 9 years ago

You can get the training files at this URL: http://genoweb.toulouse.inra.fr/~fescudie/. But this phenomena is already present with the training example dataset in RDPTools (120% CPU):

java -Xmx1g -jar path/to/classifier.jar train -o mytrained -s path/to/RDPTools/classifier/samplefiles/new_trainset.fasta -t path/to/RDPTools/classifier/samplefiles/new_trainset_db_taxid.txt
cp path/to/RDPTools/classifier/samplefiles/rRNAClassifier.properties mytrained
java -Xmx15g -jar path/to/classifier.jar classify -c 0.8 -t mytrained/rRNAClassifier.properties -o result.rdp sub.fasta

Consumption:

top - 10:23:54 up 56 days, 23:09,  0 users,  load average: 9.19, 8.95, 10.32
Tasks: 840 total,  10 running, 830 sleeping,   0 stopped,   0 zombie
Cpu(s): 25.5%us,  0.1%sy,  0.0%ni, 74.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  264438700k total, 78953232k used, 185485468k free,   174720k buffers
Swap: 16777208k total,    36100k used, 16741108k free, 64773140k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                        
 66617 fescudie  20   0 18.4g 590m  10m S 120.6  0.2   0:29.30 java                                                                                                                                          
 66655 fescudie  20   0 13684 1784  884 R  0.7  0.0   0:00.13 top                                                                                                                                            
 65432 fescudie  20   0  104m 1948 1408 S  0.0  0.0   0:00.30 bash