reactome / add-links

This is the AddLinks component of the Release system
0 stars 1 forks source link

Wrong species code for some KEGG identifiers #130

Closed SolomonShorser-OICR closed 4 years ago

SolomonShorser-OICR commented 4 years ago

In R71, link was created for identifier mtv:Rv2540c but it should have been mtu:Rv2540c. Possible that there are conflicts in species codes for KEGG.

SolomonShorser-OICR commented 4 years ago

It looks like the programmatic list of KEGG species doesn't make an easy distinction between mtu and mtv. See: http://www.genome.jp/kegg-bin/download_htext?htext=br08601.keg&format=htext&filedir= and also https://www.genome.jp/kegg/catalog/org_list.html

In the first link, "Mycobacterium tuberculosis H37Rv" appears twice, once with the mtu prefix and once with the mtv prefix. Nothing else distinguishes these, and because the species names are identical, it means that performing a lookup by species name could cause problems.

In the second link, it can be seen that mtu refers to data coming from RefSeq and mtv refers to data whose source is GeneBank.

Might need to add special code for M. Tuberculosis to include the prefix in the species name, so: "Mycobacterium tuberculosis H37Rv (mtu)" and "Mycobacterium tuberculosis H37Rv (mtv)". It will make look-ups by species-name more complicated, so maybe look-ups should be prefix-match rather then requiring a full-string match...