sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
473 stars 80 forks source link

GTDB release 9 - RS220 - databases #3183

Open jorondo1 opened 4 months ago

jorondo1 commented 4 months ago

Hi, I was wondering if the gtdb reps 220 had been sketched (or is about to be)? Otherwise, would it be possible to provide/point to the command you use to do it? I imagine it's fairly straightforward, but in case there a some special considerations to obtain the same format as the ones you publish, I prefered asking.

cheers

ctb commented 4 months ago

we're working on it! The only really trick part is naming the sketches. @bluegenes or maybe @ccbaumler could probably point you at the right code more easily than me; otherwise I'll track it down when I have time :)

ccbaumler commented 4 months ago

There have been quite a few developments with sourmash's database scripts recently.

@bluegenes and I will be meeting Wednesday to discuss where we stand currently with database creation/updating. I'll add this to my notes.

Should be able to create the updated release GTDB rather quickly all things considered. Check back in with this issue in a week's time if you don't hear anything.

ccbaumler commented 4 months ago

Update! We have a bug that we are working through. #3191

ctb commented 3 months ago

databases available here: https://github.com/sourmash-bio/sourmash/issues/3246

jorondo1 commented 2 months ago

Hi! I used the rs220 release to profile a bunch of metagenomes. In a set of samples, I found overall 8070 species, but only 6956 of them are in the lineage file (or the entire gtdb bac120_taxonomy file for that matter). The ones that are not are currently not listed as representative species in GTDB 220 (e.g. I got a hit for this one)

Could there be genomes in the database that used to be reps in rs214 but aren't anymore?

ccbaumler commented 2 months ago

Update: we also need to update the links in the issue https://github.com/sourmash-bio/sourmash/issues/3246 to the current database! Please hold!

TL;DR - Two potential bugs with prefix and suffix of the idents.

Two things may be at play here:

  1. You may be using the full database and searching the reps lineage file.
$ grep  GCF_000466605.1  gtdb-rs220.lineages.csv
GCF_000466605.1,f,d__Bacteria,p__Bacillota_A,c__Clostridia,o__Oscillospirales,f__Ruminococcaceae,g__Bittarella,s__Bittarella massiliensis
$ grep  GCF_000466605.1  gtdb-rs220.lineages.reps.csv

searching with your provided example, I can find the identifier in the full database lineage file (listed as a false representative sequence). It was not located in the representative lineage file at all.

Following your link, I navigated to the representative sequence for here

$ grep   GCF_001486165.1   gtdb-rs220.lineages.csv
GCF_001486165.1,t,d__Bacteria,p__Bacillota_A,c__Clostridia,o__Oscillospirales,f__Ruminococcaceae,g__Bittarella,s__Bittarella massiliensis
$ grep   GCF_001486165.1   gtdb-rs220.lineages.reps.csv
GCF_001486165.1,t,d__Bacteria,p__Bacillota_A,c__Clostridia,o__Oscillospirales,f__Ruminococcaceae,g__Bittarella,s__Bittarella massiliensis

I see that this sequence is in both lineage files and listed as a representative sequence.

Searching the actual databases:

$ sourmash sig grep _000466605.1 gtdb-rs220-k21.zip

== This is sourmash version 4.8.8. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
                                                                                                                                                                                                                     saving matching signatures to '-'
loaded 596565 total that matched ksize & molecule type
extracted 1 signatures from 1 file(s)
[{"class":"sourmash_signature","email":"","hash_function":"0.murmur64","filename":"GCA_000466605.1_genomic.fna.gz","name":"GCA_000466605.1 Clostridium sp. ATCC 29733 strain=ATCC 29733 ASM46660v1"

Notice that the GCF prefix is in the lineage file and the GCA prefix is in the database. That seems like a bug.

$ sourmash sig grep _000466605.1 gtdb-reps-rs220-k21.zip

== This is sourmash version 4.8.8. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

saving matching signatures to '-'
loaded 107212 total that matched ksize & molecule type
no matching signatures found!

the non-rep sequence is not in the rep database

$ sourmash sig grep GCF_001486165.1 gtdb-reps-rs220-k21.zip

== This is sourmash version 4.8.8. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

saving matching signatures to '-'
loaded 107212 total that matched ksize & molecule type
extracted 1 signatures from 1 file(s)
[{"class":"sourmash_signature","email":"","hash_function":"0.murmur64","filename":"/dev/fd/63","name":"GCF_001486165.1 Bittarella massiliensis strain=GD6, Bittarella massiliensis"
$ sourmash sig grep GCF_001486165.1 gtdb-rs220-k21.zip                                                                                                                                                                                                                                                                                                  == This is sourmash version 4.8.8. ==                                                                                                                                                                                == Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==                                                                                                                                                                                                                                                                                                                                                                        saving matching signatures to '-'                                                                                                                                                                                    loaded 596565 total that matched ksize & molecule type                                                                                                                                                               extracted 1 signatures from 1 file(s)                                                                                                                                                                                [{"class":"sourmash_signature","email":"","hash_function":"0.murmur64","filename":"/dev/fd/63","name":"GCF_001486165.1 Bittarella massiliensis strain=GD6, Bittarella massiliensis"

The rep sequence is in both databases

  1. One step in database curation of release has change to update sequence by their most current versions in Genbank's database and not the ones that GTDB used at time of release. This may cause the lineage file to miss the files due to the float suffix at the end of the identifier. Another potential bug.

Hope this adds some clarity. Please share an example and I can dig in more to your specific problem.

ccbaumler commented 2 months ago

The lineage file and databases for GTDB should now be in sync with this commit (https://github.com/sourmash-bio/database-releases/pull/6/commits/47cc39777d6fe838796f404cd3e5f0b078585869)

I updated the scripts used to check the databases manifests for any updates as well as using the same genbank metadata to update the taxonomy file.

I am running the scripts now and will then update the links with the new databases.

ccbaumler commented 1 month ago

I have completed and (hopefully) address the bugs noticed above. The final draft of the 220 release from GTDB is located /group/ctbrowngrp4/2024-ccbaumler-gtdb/gtdb-220 for the time being on the Farm. A summary report may be viewed here.

Running a quick check to compare numbers:

$ sourmash sig check --picklist gtdb-rs220.lineages.csv:ident:ident gtdb-rs220-k21.zip

== This is sourmash version 4.8.8. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

picking column 'ident' of type 'ident' from 'gtdb-rs220.lineages.csv'
loaded 596628 distinct values into picklist.
loaded 596628 signatures.
for given picklist, found 596628 matches to 596628 distinct values

The lineage file and the database align!

sourmash sig check --picklist data/gtdb-rs220.oldlineages.csv:ident:ident gtdb-rs220-k21.zip

== This is sourmash version 4.8.8. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

picking column 'ident' of type 'ident' from 'data/gtdb-rs220.oldlineages.csv'
loaded 596859 distinct values into picklist.
loaded 596628 signatures.
for given picklist, found 595826 matches to 596859 distinct values

When I compare these numbers against the report:

  1. there are 231 removed genomes from GenBank within the latest GTDB release.
  2. there are 802 genomes from GenBank that require an updated version in the latest GTDB release.
  3. there are a total of 1033 genomes that have been removed or updated in the latest GTDB release.

596859 - 596628 = 231

596859 - 595826 = 1033

From the summary report

172 + 59 = 231

580 + 453 = 1033

@ctb Could you please replace the zip and csv files in /group/ctbrowngrp/sourmash-db/gtdb-rs220/ with /group/ctbrowngrp4/2024-ccbaumler-gtdb/gtdb-220 to update the links at https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db/gtdb-rs220/

$ ls -1
data
gtdb-reps-rs220-k21.zip
gtdb-reps-rs220-k31.zip
gtdb-reps-rs220-k51.zip
gtdb-rs220-k21.zip
gtdb-rs220-k31.zip
gtdb-rs220-k51.zip
gtdb-rs220.lineages.csv
gtdb-rs220.lineages.reps.csv
report
workflow-cleanup

I will being work on the protein databases now... sigh