The number of GTDB complete genomes mismatched with that in official website due to some records are deleted in NCBI

shenwei356 commented 9 months ago

GTDB complete genomes

time genome_updater.sh -d "refseq,genbank" -g "archaea,bacteria" -f "genomic.fna.gz" -o "GTDB_complete" -M "gtdb" -t 12 -m -L curl -i

cd GTDB_complete/2024-01-30_19-34-40/
wc -l assembly_summary.txt 
402538 assembly_summary.txt

Oh, 402,538 < 402,709 genomes! 402,709 is from https://gtdb.ecogenomic.org/.

Check it.

# download metadata
wget https://data.gtdb.ecogenomic.org/releases/release214/214.1/ar53_metadata_r214.tsv.gz
wget https://data.gtdb.ecogenomic.org/releases/release214/214.1/bac120_metadata_r214.tsv.gz

# concatenate metadata
(zcat ar53_metadata_r214.tsv.gz; zcat bac120_metadata_r214.tsv.gz | sed 1d) > metadata.tsv

# check missing
cd GTDB_complete/2024-01-30_19-34-40/
csvtk replace -t -p '^...' ../../metadata.tsv |  csvtk grep -t -v -P <(cut -f 1 assembly_summary.txt)  > missing.tsv

csvtk dim -t missing.tsv
file         num_cols  num_rows
missing.tsv       110       171

So, 171 genomes are missing. Here's the full list: missing.txt.

$ csvtk cut -t -f accession missing.tsv | head -n 5
accession
GCA_024650005.1
GCF_023371115.1
GCF_024450885.1
GCF_024654755.1

Manually searched them (with and without version .1) on NCBI, and no records were found. So they are removed.

Who are they.

$ csvtk freq -t -f ncbi_organism_name missing.tsv -nr | csvtk pretty -t
ncbi_organism_name                                     frequency
----------------------------------------------------   ---------
Escherichia coli                                       78
Acinetobacter baumannii                                38
Klebsiella pneumoniae                                  13
Pseudomonas aeruginosa                                 10
Staphylococcus xylosus                                 5
Acinetobacter nosocomialis                             2
Proteus mirabilis                                      2
Candidatus Bathyarchaeota archaeon                     1
Chromohalobacter sp. TMW 2.2303                        1
Enterobacter cloacae                                   1
Enterobacter hormaechei                                1
Enterobacter roggenkampii                              1
Enterobacter sp. ODB01                                 1
Fusobacterium sp. Marseille-Q7035                      1
Klebsiella oxytoca                                     1
Klebsiella quasipneumoniae                             1
Klebsiella sp. VKM B-1436                              1
Limnospira indica PCC 8005                             1
Methylobacillus methanolivorans                        1
Oscillospiraceae bacterium BX18                        1
Pseudoalteromonas rhizosphaerae                        1
Pseudomonas graminis                                   1
Pseudomonas qingdaonensis                              1
Salmonella enterica                                    1
Salmonella enterica subsp. enterica serovar Kedougou   1
Salmonella enterica subsp. enterica serovar Stanley    1
Streptomyces sp. GBA 94-10                             1
Stutzerimonas frequens                                 1
Tunicatimonas sp. TK19036                              1
Xylella fastidiosa subsp. multiplex                    1

shenwei356 commented 9 months ago

Additionally, GCF_002882255.1's URL changed.

old: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/882/255/GCF_002882255.1_FW507-14D01 new: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/882/255/GCF_002882255.1_GW456-11-11-14-TSB2/

pirovc commented 9 months ago

Unfortunately that is the case and some records are removed from NCBI forever. Thanks for the investigation, I will add this information to the README and link this issue.

shenwei356 commented 8 months ago

@donovan.parks' reply

NCBI generally (never?) deletes data, but data records can become suppressed. For example, GCA_024650005.1 has been suppressed, but you can still find information about this record and how to download the data at: https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_024650005.1/

pirovc commented 8 months ago

Thanks for the investigation. Did you ever manage to find sequences for any of those "suppressed" entries?

The metadata still exists in the ftp and NCBI website but you never get to the sequence. Even if you go to the WGS entry, it's not there (I tried this and this). There are many different reasons for suppression and maybe in some of them is possible to still retrieve data.

In the case of changing URL as you mentioned above, it eventually gets updated on the main assembly_summary_refseq.txt.

Note that genome_updater already scrappes the "suppressed" or older entries from the assembly_summary_refseq_historical.txt, but it only holds metadata. If the sequence is not in the ftp, it will skip it.

shenwei356 commented 8 months ago

I just ignored these ungettable records. :grinning:

genome_updater is already good enough for downloading genbank+refseq assemblies, I once tried to generate URLs from the assembly_summary file but it turned out more effort was needed.

A few days ago, I downloaded the whole 2 million prokaryotic genomes, there were only 3 genomes failing to download.

pirovc / genome_updater

The number of GTDB complete genomes mismatched with that in official website due to some records are deleted in NCBI #94