Closed mkdevesh closed 7 months ago
Hi @mkdevesh,
Thanks for opening this issue. We always appreciate bug reports but in this case I believe the command-line tool is returning the correct data. If you can find any examples of specific genomes (with identifiers) that are missing from your request, please share them and I would be happy to investigate further.
I cross-checked the number for E. coli genomes and it is 4807 as of now.
How did you cross-check the number of E. coli genomes at the assembly level chromosome
? If you could share the source you used for this, then I can double-check and make sure that I didn't miss anything.
then I tried 'complete' assembly level which resulted in 7,763 which is quite surprising as it should be lesser than chromosome level
Surprisingly, the number of E. coli genomes at the assembly level of complete
is more than the number of E. coli genomes at the assembly level of chromosome
.
I look forward to hearing from you soon.
Best, Eric
Eric Cox, PhD [Contractor] (he/him/his) NCBI Datasets Sequence Enhancements, Tools and Delivery (SeqPlus) NIH/NLM/NCBI eric.cox@nih.gov
Hi, the numbers does not match the number of genomes that is actually there. I cross-check with the datasets website : https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=562&assembly_level=2:3 Here is the snip of the website.
FYI, now the number of genomes are 4847 at chromosome level. 40 genomes were uploaded last week.
Hi @mkdevesh,
Thanks for sharing this. In the screenshot above, the assembly-level filter is set to show genomes at the assembly levels of both chromosome and complete, which correctly shows 4,847 genomes. You can get the same count using the command-line tool by restricting to GenBank genomes as follows:
datasets summary genome taxon 562 --assembly-level chromosome,complete --assembly-source genbank --limit none {"total_count": 4847}
A note about comparing counts from the genome table page and the command-line tool: On the genome table page we show the genome count as the number of GenBank (GCA)/RefSeq (GCF) assembly pairs, to avoid counting genomes twice. On the CLI, GCA and GCF records are counted separately (as you found in your other issue, #342).
Another note about the assembly level filter on the genome table page: Although in the example that you shared above, the filter is working as expected, I noticed that if you try to select "chromosome" level genomes only, then we are showing both chromosome and complete genomes. This is a bug and we are going to try to fix this in the next 2-4 weeks.
Best, Eric
Thank you so much this helps a lot.
I am trying to download genome of assembly level 'chromosome' for several bacterial taxon. But realized that there were less number of total genomes being downloaded. I cross-checked the number for E. coli genomes and it is 4807 as of now. here is the code used for downloading and tried 'dehydrated' option as well but the same results and then I tried 'complete' assembly level which resulted in 7,763 which is quite surprising as it should be lesser than chromosome level. And also the number is quite similar(763 and 7,763) which I have no clue why.
Now I have to question all the downloads as this has become unreliable. Please solve this issue so that it downloads the correct number of genomes at that time. Thanks