Prediction of spurious Orthologs in large datasets

GoogleCodeExporter commented 9 years ago

Dear Authors,

OrthAgogue is a really useful software, specially in terms of its speed for 
generating orthology relationships in large datasets. My compliments! I ran 
OrthAgogue on a large dataset (all vs all BLAST of 1000 bacteria, 200 archaea, 
50 eularyotes), comprising ~20 million protein sequences. Later I ran the MCL 
program on the OrthAgogue results to generate clusters of proteins. In general 
I am very happy with the results. Except in few cases I find a particular 
problem

Protein A of Organism X (unicellular eukaryote) is clustered with bacterial 
sequences only. However Protein A shows higher homology with proteins from 
Organisms P, Q, R ( unicellular eukaryotes). But the proteins from organisms 
P,Q,R are present in a separate cluster with proteins from multicellular 
eukaryotic organisms. In this scenario I run the risk of falsely classifying 
protein A to be originating by lateral gene transfer from bacteria. Kindly help 
me resolve this issue.

The parameters I used to run OrthAgogue are --overlap 50 --use_scores

Original issue reported on code.google.com by projectb...@gmail.com on 15 May 2015 at 2:45

GoogleCodeExporter commented 9 years ago

Hi,

Thanks for both an interesting issue-report, and the superlatives of our tool!

Regarding your question, I have requested some comments from one of my 
co-authors, Pof. M. Kuiper, which may below answer is based upon. Based on the 
comments made by PRof. M. Kuiper, our first impression (which may be wrong, ie, 
as it is a first impression ;) ), is that the subtleties which you have 
observed, could stem from tools either used to construct the input or tools to 
analyse the result of orthAgogue.

From your issue-report, we infer that you for one particular gene have 
identified  subsequent co-clustering which is not as it should be. Given this 
assumption (of correctly understanding your issue-report), then it does not 
join a larger set of similar genes from  eukaryotes (to which it has a higher 
homology), but rather joins a prokaryote cluster. 

Therefore, without knowing the actual levels of homology of these genes, it is 
difficult to say what is the problem, ie, if there is any. If the homologies 
themselves are pointing to these cluster memberships, than a strategy could be 
investigating the results produced by BLAST. If the homologies indeed would 
point to different cluster memberships, than it would sound a bit odd if (and 
how) orthagogue can overrule that. Therefore, if the latter seems to be the 
case, may the problem lay in MCL?

Given these thoughts, it would be interesting to get your feedback, ie, to both 
resolve the issue, and (hopefully) to acquire some knew knowledge (of 
approaches in the field).

Best,

Ole Kristian Ekseth, 
developer of orthAgogue

Original comment by oeks...@gmail.com on 17 May 2015 at 12:23

GoogleCodeExporter commented 9 years ago

Thank you for the quick response. Here I elaborate on the parameters used in my 
analysis

#BLASTp STEP
blastp -outfmt 6 -evalue 1e-5 -word_size 4 -threshold 18 -seg 'yes' 
-max_target_seqs 100000  -dbsize 2543962
#HERE DBSIZE REFERS TO THE NUMBER OF PROTEINS IN MY FASTA FILE

#OrthAgogue STEP ON THE BLAST OUTPUT
orthAgogue --seperator '|' --cpu 16 --overlap 50 --use_scores

#MCL STEP ON THE ORTHAGOGUE OUTPUT
mcl --abc -I 1.5

The programs were run on a linux machine (Ubuntu 12.04 LTS) with 4 Intel Xeon 
1.2 GHz QuadCore processors and 256 GB RAM. OrthAgogue took approximately 22 
minutes to finish computation using a maximum of 130 GB of RAM and all 
processing cores. Since the BLASTp output generated is huge (127 GB) I am not 
able to share it with you. However I would like to specify my problem with the 
following attached files.

I have a diatom protein (ID: 565099) and a bacterial protein (ID: 1055246) 
which are reported to be present in the same cluster (post processing 
OrthAgogue output with MCL). The proteins are retained in the same cluster even 
after using different inflation parameters in MCL (1.2-1.5). Further when I 
look into the raw all.abc OrthAgogue output I find that for the bacterial 
protein a weight is assigned against the diatom protein even though it has much 
higher score hits against proteins from other organisms like metazoans (ID: 
21263).

Please find attached the BLASTp output for the diatom and the bacterial 
proteins as well as their subset of results from the all.abc output file. I 
hope it can help you. Ideally if you can provide me a way to send my BLAST 
results then it would be perfect.

In the end I must say that such problems are present only for a small 
population of my proteins, in fact majority of the protein clusters are well in 
accord with known taxonomy. My suspicion at the moment is that if the large 
number of proteins from bacterial phyla are creating a problem. Since to reduce 
complexity I had clustered all bacterial proteomes within a each phyla as 
single dataset with CD-Hit. Hence all proteins of a bacterial phyla say 
firmicutes are perceived as a single organism by orthAgogue.

Original comment by projectb...@gmail.com on 19 May 2015 at 10:20

Attachments:

files.zip

ghg296 / orthagogue

Prediction of spurious Orthologs in large datasets #7