mptrsen / Orthograph

Orthology prediction using a graph-based, reciprocal approach with profile hidden Markov models
GNU General Public License v3.0
32 stars 5 forks source link

Identical gene names in different official gene sets lead to an omission of these genes #37

Open cmayer opened 3 years ago

cmayer commented 3 years ago

Especially in vertebrate genome projects it is common that the same gene name is used in different species, but Orthograph does not seem to be able to handle this.

If one loads a first OGS into Orthograph, Orthograph reports the number of sequences it read successfully. For the first set this is always the expected number. If a second set is loaded that contains gene names that are identical to gene names in the first set, the number of sequences Orthograph reports to have been entered to the data base, is smaller than the total number of genes that is present in the second set.

It seems that Orthograph does not include these genes and simply ignores them? This would be fatal for the functionality. There should at least be a major warning if this happens.

The problem could be solved by adding an OGS identifier to the gene names used by Orthograph internally.

Workaround: For genome projects in which you expect that gene names might be the same in different official gene sets, one has to rename the genes in the OGS files and the tab delimited files correspondingly, e.g. by prepending a species identifier to all gene name, which makes the gene names unique across a set of OGSs.

This problem should affect at least all vertebrate OGSs.