Closed shenwei356 closed 3 months ago
Additionally, GCF_002882255.1
's URL changed.
old: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/882/255/GCF_002882255.1_FW507-14D01 new: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/882/255/GCF_002882255.1_GW456-11-11-14-TSB2/
Unfortunately that is the case and some records are removed from NCBI forever. Thanks for the investigation, I will add this information to the README and link this issue.
@donovan.parks' reply
NCBI generally (never?) deletes data, but data records can become suppressed. For example, GCA_024650005.1 has been suppressed, but you can still find information about this record and how to download the data at: https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_024650005.1/
Thanks for the investigation. Did you ever manage to find sequences for any of those "suppressed" entries?
The metadata still exists in the ftp and NCBI website but you never get to the sequence. Even if you go to the WGS entry, it's not there (I tried this and this). There are many different reasons for suppression and maybe in some of them is possible to still retrieve data.
In the case of changing URL as you mentioned above, it eventually gets updated on the main assembly_summary_refseq.txt.
Note that genome_updater already scrappes the "suppressed" or older entries from the assembly_summary_refseq_historical.txt, but it only holds metadata. If the sequence is not in the ftp, it will skip it.
I just ignored these ungettable records. :grinning:
genome_updater
is already good enough for downloading genbank+refseq assemblies, I once tried to generate URLs from the assembly_summary file but it turned out more effort was needed.
A few days ago, I downloaded the whole 2 million prokaryotic genomes, there were only 3 genomes failing to download.
GTDB complete genomes
Oh, 402,538 < 402,709 genomes! 402,709 is from https://gtdb.ecogenomic.org/.
Check it.
So, 171 genomes are missing. Here's the full list: missing.txt.
Manually searched them (with and without version
.1
) on NCBI, and no records were found. So they are removed.Who are they.