merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
423 stars 144 forks source link

Adding the arCOG14 database to our COGs infrastructure #2136

Closed ivagljiva closed 11 months ago

ivagljiva commented 11 months ago

This PR adds a new version of COG data, arCOG14, to anvi-setup-ncbi-cogs and anvi-run-ncbi-cogs. This database is the 'Archaeal Clusters of Orthologous Groups', last released in 2014 and described in the paper Makarova, Wolf, and Koonnin 2015.

The only changes that were required to add this were: 1) adding the relevant files from https://ftp.ncbi.nih.gov/pub/wolf/COGs/arCOG/ to anvi'o's list of known COG versions, and 2) adding a few clauses for version-specific processing of these files in either program. The data formatting in this database seems to be a strange hybrid between COG14 and COG20, so in many cases, the previous clauses for one of those versions worked just fine.

There was one hiccup: a portion of the ar14.arCOG.csv file was incomplete. @meren and I confirmed that all lines starting from 388147 had a much shorter number of fields and were missing the critical arCOG ID field which we use to match protein sequences to their COG IDs. In order to bypass this issue, I wrote a very stupid hack for skipping the affected lines. Luckily, the downstream code is very smart and handles the missing protein IDs. So all that happens is that users see warnings like this when they set up arCOG14 with anvi-setup-ncbi-cogs:

WARNING
===============================================
There is a problem with the /Users/iva/software/anvio/anvio/data/misc/COG/arCOG1
4/RAW_DATA_FROM_NCBI/ar14.arCOG.csv file downloaded from NBCI. Basically,
starting from line 388147, the arCOG ID number is not provided, which means that
we cannot match those protein sequences to their COG IDs. The only solution we
have at the moment is to skip the 24385 protein IDs that are affected by this
issue. Sorry.

WARNING
===============================================
There were 24385 protein IDs without an associated COG ID. This may cause issues
later, so please keep this warning in mind. Here are a few examples of the
affected protein IDs: 340344494, 91772558, 330835314, 397775427, 300712897

And later when they annotate with arCOG14 using anvi-run-ncbi-cogs, they see a warning like this:

WARNING
===============================================
Well. Your COGs were successfully added to the database, but there were some
garbage anvi'o brushed off under the rug. There were 11 genes in your database
that hit 11 protein IDs in NCBIs COGs database, but since NCBI did not release
what COGs they correspond to in the database they made available (that helps us
to resolve protein IDs to COG ids), we could not annotate those genes with
functions. Anvi'o apologizes on behalf of all computer scientists for half-done
stuff we often force biologists to deal with. If you want to do some Googling,
these were the offending protein IDs: '148642832, 288559476, 48477089,
154151544, 159040820, 474934147, 124486342, 124485671, 124485670, 73668402,
383318644'.

But otherwise everything is fine.

You can test the new code by running the following (multithreading with -T recommended):

anvi-setup-ncbi-cogs --cog-version arCOG14 -T 2
anvi-run-ncbi-cogs -c CONTIGS.db --cog-version arCOG14 -T 2
ivagljiva commented 11 months ago

I noticed we don't have a CITATION output for these programs. I think it would be nice to add one to anvi-run-ncbi-cogs, though of course the citation is version-dependent.

I will add one, using the following references for each version:

meren commented 11 months ago

Hey @ivagljiva, thank you very much for adding support for this. I think it would be nice to add a few lines of text into the anvi-setup-ncbi-cogs and anvi-run-ncbi-cogs to make sure there is some information somewhere about which databases are downloaded.

By the way, anvio/data/misc/checksum.md5.txt is a very vague filename and I feel like we will forget about it in the long run. Shall we rename it to anvio/data/misc/CHECKSUMS-FOR-COG-DATA.txt and change affected code?

Thank you again!

meren commented 11 months ago

(I promise I will get to the citations issue at some point to offer a global solution for it :))

ivagljiva commented 11 months ago

Citation lines added :)

By the way, anvio/data/misc/checksum.md5.txt is a very vague filename and I feel like we will forget about it in the long run. Shall we rename it to anvio/data/misc/CHECKSUMS-FOR-COG-DATA.txt and change affected code?

This sounds like a good idea. I will do it.

Side note: there are also a few files in there that we don't seem to use in the COGs code, namely:

from the file names, they look like they belong to the COG14 release. I will leave them alone for now, but if you think we should remove them from anvio/data/misc/checksum.md5.txt, let me know :)

ivagljiva commented 11 months ago

I tested all version options, and they work with the new checksum file name (which only applies to COG14 and arCOG14, but I tested COG20 anyway). I will merge it now :)

meren commented 11 months ago

🚀 THANK YOU 🚀 :)