Finished as it should, but generated files were too small

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. I run blastp separately for every strains as following: 
ncbi-blast-2.2.29+/bin/blastp -evalue 1e-05  -outfmt 6 -query 0001.fasta -db 
genesOfAllStrainsDB -out genesOfstrain0001_vs_all_blastp.txt -num_alignments 
100000 -num_threads 12 . Then I combine results of all alignments with cat 
*_vs_all_blastp.txt > all_vs_all_blastp.txt
2. I used more than 80 bacterial genomes and orthagogue was run with the 
following command: orthAgogue -i ../align/all_vs_all_blastp.txt -u -o 50 -c 10. 
When I checked it some time later the program was finished, but generated files 
were very small for example all.abc was about 2 Mb for about 230000 proteins. 
3. I re-run the same command again and I got now 590Mb all.abc file. I find 
this very strange. I haven't met such problem yet. Is there a way to know if it 
worked now properly. For this, I get number of unique protein names using the 
1st column of proteins file, which is 228877 and number of genes in fasta file 
is 228879. So, it looks like OK now. I obtained gene count in a fasta file as 
following: grep '>' all_genes.fasta|wc -l and number of genes in proteins.map 
was obtained as following: cut -f1 proteins.map|sort|uniq|wc -l .

What is the expected output? What do you see instead?
I expected somewhat large file after the 1st run, but then in the 2nd run I get 
larger file. Don't know if orthAgogue worked correctly now.

What version of the product are you using? On what operating system?
I use version 1.0.3 of orthagogue. And an operating system is: Ubuntu 14.04.1 
LTS

Please provide any additional information below.
How can I check if orthagogue worked properly? Can I use number of genes in 
proteins.map file to check if it equals to number of genes that were used in 
blastp alignment? Because it looks OK for this project in another project I run 
orthagogue earlier number of genes in fasta file was 460565 but number of 
protein ids in proteins.map file was 456393. Here I see about 4000 genes 
missing proteins.map and for this project more genomes were used, therefore 
resulting blastp file was around 21Gb. Thank you for all.

Original issue reported on code.google.com by jbay...@gmail.com on 18 Oct 2014 at 1:18

GoogleCodeExporter commented 9 years ago

Hi,

Thanks for an interesting issue! ;)
----
3. I re-run the same command again and I got now 590Mb all.abc file. I find 
this very strange. I haven't met such problem yet. Is there a way to know if it 
worked now properly. For this, I get number of unique protein names using the 
1st column of proteins file, which is 228877 and number of genes in fasta file 
is 228879. So, it looks like OK now. I obtained gene count in a fasta file as 
following: grep '>' all_genes.fasta|wc -l and number of genes in proteins.map 
was obtained as following: cut -f1 proteins.map|sort|uniq|wc -l .
----
-- If I understand you correctly, the same call to orthAgogue result in 
different results. (If this interpretation of your words is wrong, then please 
give me a word.) Assuming my understanding (of your words) is correct, this 
indicates that there are errors in the parallelisation (in orthAgogue). To help 
investigate the issue (which I've now marked as critical), may you:
(1) try different numbers of CPUs, and 
(2) if the result yealds different results, then 
(a) download the source, 
(b) install orthAgogue using the "install_debug.bash" script,
(c) try different numbers of cpu's, and then send me the results found in 
report_orthAgogue/ ?

-- If you could, then I hope to identify where the bug is found (eg, in the 
parsing or in the computation of putative orthologs.

-------------------------
I expected somewhat large file after the 1st run, but then in the 2nd run I get 
larger file. Don't know if orthAgogue worked correctly now.
-------------------------
-- the results seems strange, though finding bugs in parallel applications are 
always difficult, ie, I'd be thankful if you could help me solving this issue.

-----------
Please provide any additional information below.
How can I check if orthagogue worked properly? Can I use number of genes in 
proteins.map file to check if it equals to number of genes that were used in 
blastp alignment? Because it looks OK for this project in another project I run 
orthagogue earlier number of genes in fasta file was 460565 but number of 
protein ids in proteins.map file was 456393. Here I see about 4000 genes 
missing proteins.map and for this project more genomes were used, therefore 
resulting blastp file was around 21Gb. Thank you for all.
-----------
-- Thanks for your question: regarding the causes and effects of the bug, we 
will first know it when we've investigated the issue. My first assumption is 
that the bug is found in the parsing, for which the result will look correct, 
ie, investigating the result-file may be of no help if the bug is found in a 
different part of the orthAgogue pipeline. 
-- in order to get an idea of where the bug is found (and its effect), if you 
run the software after having installed it using "./install_debug.sh", then the 
file "report_orthAgogue/list_file_parse.log.*" will give you the number of 
relations before filtering with respect to orthologs, while 
"report_orthAgogue/taxa_list.log" gives a generalized summary of the result: my 
hope is that a comparison of the case with one cpu VS the other 'cpu cases' 
will help us to identify the locaiton of the bug.

Again many thanks for making me aware of a possible bug in the parallelisation. 
Hope this answer at least clarified some points in your issue: looking forward 
for your feedback, and again many thanks for your help posting it! 

PS: There might be other reasons for this error, so if you could regard my 
assertions as an initial hypothesis, I'd be thankful ;)

Best,

Ole Kristian,
Developer of orthAgogue

Original comment by oeks...@gmail.com on 18 Oct 2014 at 4:34

Added labels: Priority-Critical
Removed labels: Priority-Medium

GoogleCodeExporter commented 9 years ago

Dear Ole,
Thank you for all and especially for your very quick answer. Apologies for 
delay, had a vacation, other tasks and, especially, I couldn't reproduce this 
issue.
Yes that was the correct interpretation of what I said. I think it was a 
problem on my system, because there was several other heavy-duty tasks were 
also running therefore it could have been killed by a system or there were not 
enough resources (memory, HDD, etc.). Anyway I can't reproduce it, so I assume 
this issue is closed. 

I wanted to inform you that I am going to use the following criterion to decide 
whether orthAgogue worked correctly or not. The criterion is: "the number of 
proteins in proteins.map file must be equal to or slightly (only couple of 
genes) less than the total number of proteins that are present in all genomes 
that were used in orthology prediction". 
Thank you for this fast program.
Best regards,
Juma

Original comment by jbay...@gmail.com on 11 Nov 2014 at 12:01

ghg296 / orthagogue

Finished as it should, but generated files were too small #6