Open jorondo1 opened 4 months ago
we're working on it! The only really trick part is naming the sketches. @bluegenes or maybe @ccbaumler could probably point you at the right code more easily than me; otherwise I'll track it down when I have time :)
There have been quite a few developments with sourmash's database scripts recently.
@bluegenes and I will be meeting Wednesday to discuss where we stand currently with database creation/updating. I'll add this to my notes.
Should be able to create the updated release GTDB rather quickly all things considered. Check back in with this issue in a week's time if you don't hear anything.
Update! We have a bug that we are working through. #3191
databases available here: https://github.com/sourmash-bio/sourmash/issues/3246
Hi! I used the rs220 release to profile a bunch of metagenomes. In a set of samples, I found overall 8070 species, but only 6956 of them are in the lineage file (or the entire gtdb bac120_taxonomy file for that matter). The ones that are not are currently not listed as representative species in GTDB 220 (e.g. I got a hit for this one)
Could there be genomes in the database that used to be reps in rs214 but aren't anymore?
Update: we also need to update the links in the issue https://github.com/sourmash-bio/sourmash/issues/3246 to the current database! Please hold!
TL;DR - Two potential bugs with prefix and suffix of the idents.
Two things may be at play here:
reps
lineage file.$ grep GCF_000466605.1 gtdb-rs220.lineages.csv
GCF_000466605.1,f,d__Bacteria,p__Bacillota_A,c__Clostridia,o__Oscillospirales,f__Ruminococcaceae,g__Bittarella,s__Bittarella massiliensis
$ grep GCF_000466605.1 gtdb-rs220.lineages.reps.csv
searching with your provided example, I can find the identifier in the full database lineage file (listed as a false representative sequence). It was not located in the representative lineage file at all.
Following your link, I navigated to the representative sequence for here
$ grep GCF_001486165.1 gtdb-rs220.lineages.csv
GCF_001486165.1,t,d__Bacteria,p__Bacillota_A,c__Clostridia,o__Oscillospirales,f__Ruminococcaceae,g__Bittarella,s__Bittarella massiliensis
$ grep GCF_001486165.1 gtdb-rs220.lineages.reps.csv
GCF_001486165.1,t,d__Bacteria,p__Bacillota_A,c__Clostridia,o__Oscillospirales,f__Ruminococcaceae,g__Bittarella,s__Bittarella massiliensis
I see that this sequence is in both lineage files and listed as a representative sequence.
Searching the actual databases:
$ sourmash sig grep _000466605.1 gtdb-rs220-k21.zip
== This is sourmash version 4.8.8. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
saving matching signatures to '-'
loaded 596565 total that matched ksize & molecule type
extracted 1 signatures from 1 file(s)
[{"class":"sourmash_signature","email":"","hash_function":"0.murmur64","filename":"GCA_000466605.1_genomic.fna.gz","name":"GCA_000466605.1 Clostridium sp. ATCC 29733 strain=ATCC 29733 ASM46660v1"
Notice that the
GCF
prefix is in the lineage file and theGCA
prefix is in the database. That seems like a bug.
$ sourmash sig grep _000466605.1 gtdb-reps-rs220-k21.zip
== This is sourmash version 4.8.8. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
saving matching signatures to '-'
loaded 107212 total that matched ksize & molecule type
no matching signatures found!
the non-rep sequence is not in the rep database
$ sourmash sig grep GCF_001486165.1 gtdb-reps-rs220-k21.zip
== This is sourmash version 4.8.8. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
saving matching signatures to '-'
loaded 107212 total that matched ksize & molecule type
extracted 1 signatures from 1 file(s)
[{"class":"sourmash_signature","email":"","hash_function":"0.murmur64","filename":"/dev/fd/63","name":"GCF_001486165.1 Bittarella massiliensis strain=GD6, Bittarella massiliensis"
$ sourmash sig grep GCF_001486165.1 gtdb-rs220-k21.zip == This is sourmash version 4.8.8. == == Please cite Brown and Irber (2016), doi:10.21105/joss.00027. == saving matching signatures to '-' loaded 596565 total that matched ksize & molecule type extracted 1 signatures from 1 file(s) [{"class":"sourmash_signature","email":"","hash_function":"0.murmur64","filename":"/dev/fd/63","name":"GCF_001486165.1 Bittarella massiliensis strain=GD6, Bittarella massiliensis"
The rep sequence is in both databases
Hope this adds some clarity. Please share an example and I can dig in more to your specific problem.
The lineage file and databases for GTDB should now be in sync with this commit (https://github.com/sourmash-bio/database-releases/pull/6/commits/47cc39777d6fe838796f404cd3e5f0b078585869)
I updated the scripts used to check the databases manifests for any updates as well as using the same genbank metadata to update the taxonomy file.
I am running the scripts now and will then update the links with the new databases.
I have completed and (hopefully) address the bugs noticed above. The final draft of the 220 release from GTDB is located /group/ctbrowngrp4/2024-ccbaumler-gtdb/gtdb-220
for the time being on the Farm. A summary report may be viewed here.
Running a quick check to compare numbers:
$ sourmash sig check --picklist gtdb-rs220.lineages.csv:ident:ident gtdb-rs220-k21.zip
== This is sourmash version 4.8.8. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
picking column 'ident' of type 'ident' from 'gtdb-rs220.lineages.csv'
loaded 596628 distinct values into picklist.
loaded 596628 signatures.
for given picklist, found 596628 matches to 596628 distinct values
The lineage file and the database align!
sourmash sig check --picklist data/gtdb-rs220.oldlineages.csv:ident:ident gtdb-rs220-k21.zip
== This is sourmash version 4.8.8. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
picking column 'ident' of type 'ident' from 'data/gtdb-rs220.oldlineages.csv'
loaded 596859 distinct values into picklist.
loaded 596628 signatures.
for given picklist, found 595826 matches to 596859 distinct values
When I compare these numbers against the report:
- there are 231 removed genomes from GenBank within the latest GTDB release.
- there are 802 genomes from GenBank that require an updated version in the latest GTDB release.
- there are a total of 1033 genomes that have been removed or updated in the latest GTDB release.
596859 - 596628 = 231
596859 - 595826 = 1033
From the summary report
172 + 59 = 231
580 + 453 = 1033
@ctb Could you please replace the zip
and csv
files in /group/ctbrowngrp/sourmash-db/gtdb-rs220/
with /group/ctbrowngrp4/2024-ccbaumler-gtdb/gtdb-220
to update the links at https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db/gtdb-rs220/
$ ls -1
data
gtdb-reps-rs220-k21.zip
gtdb-reps-rs220-k31.zip
gtdb-reps-rs220-k51.zip
gtdb-rs220-k21.zip
gtdb-rs220-k31.zip
gtdb-rs220-k51.zip
gtdb-rs220.lineages.csv
gtdb-rs220.lineages.reps.csv
report
workflow-cleanup
I will being work on the protein databases now... sigh
Hi, I was wondering if the gtdb reps 220 had been sketched (or is about to be)? Otherwise, would it be possible to provide/point to the command you use to do it? I imagine it's fairly straightforward, but in case there a some special considerations to obtain the same format as the ones you publish, I prefered asking.
cheers