ncbi / datasets

NCBI Datasets is a new resource that lets you easily gather data from across NCBI databases.
https://www.ncbi.nlm.nih.gov/datasets
Other
369 stars 41 forks source link

Limited number of genome downloads for some taxon #340

Closed mkdevesh closed 7 months ago

mkdevesh commented 7 months ago

I am trying to download genome of assembly level 'chromosome' for several bacterial taxon. But realized that there were less number of total genomes being downloaded. I cross-checked the number for E. coli genomes and it is 4807 as of now. here is the code used for downloading and tried 'dehydrated' option as well but the same results and then I tried 'complete' assembly level which resulted in 7,763 which is quite surprising as it should be lesser than chromosome level. And also the number is quite similar(763 and 7,763) which I have no clue why.

E:\R\blast_test>datasets download genome taxon 562 --assembly-level chromosome --dehydrated --filename Coli2_dataset.zip
Collecting 763 genome records [================================================] 100% 763/763
Downloading: Coli2_dataset.zip    331kB valid zip structure -- files not checked
Validating package [================================================] 100% 4/4

E:\R\blast_test>datasets download genome taxon 562 --assembly-level chromosome --dehydrated --include genome --filename Coli2_dataset.zip
Collecting 763 genome records [================================================] 100% 763/763
Downloading: Coli2_dataset.zip    331kB valid zip structure -- files not checked
Validating package [================================================] 100% 4/4

E:\R\blast_test>datasets download genome taxon 562 --assembly-level complete --dehydrated --include genome --filename Coli2_dataset.zip
Collecting 7,763 genome records [================================================] 100% 7763/7763
Downloading: Coli2_dataset.zip    3.18MB valid zip structure -- files not checked
Validating package [================================================] 100% 4/4

E:\R\blast_test>datasets --version
datasets version: 16.10.1

Now I have to question all the downloads as this has become unreliable. Please solve this issue so that it downloads the correct number of genomes at that time. Thanks

ericcox1 commented 7 months ago

Hi @mkdevesh,

Thanks for opening this issue. We always appreciate bug reports but in this case I believe the command-line tool is returning the correct data. If you can find any examples of specific genomes (with identifiers) that are missing from your request, please share them and I would be happy to investigate further.

I cross-checked the number for E. coli genomes and it is 4807 as of now.

How did you cross-check the number of E. coli genomes at the assembly level chromosome? If you could share the source you used for this, then I can double-check and make sure that I didn't miss anything.

then I tried 'complete' assembly level which resulted in 7,763 which is quite surprising as it should be lesser than chromosome level

Surprisingly, the number of E. coli genomes at the assembly level of complete is more than the number of E. coli genomes at the assembly level of chromosome.

I look forward to hearing from you soon.

Best, Eric

Eric Cox, PhD [Contractor] (he/him/his) NCBI Datasets Sequence Enhancements, Tools and Delivery (SeqPlus) NIH/NLM/NCBI eric.cox@nih.gov

mkdevesh commented 7 months ago

Hi, the numbers does not match the number of genomes that is actually there. I cross-check with the datasets website : https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=562&assembly_level=2:3 Here is the snip of the website. Screenshot 2024-04-03 092203

FYI, now the number of genomes are 4847 at chromosome level. 40 genomes were uploaded last week.

ericcox1 commented 7 months ago

Hi @mkdevesh,

Thanks for sharing this. In the screenshot above, the assembly-level filter is set to show genomes at the assembly levels of both chromosome and complete, which correctly shows 4,847 genomes. You can get the same count using the command-line tool by restricting to GenBank genomes as follows: datasets summary genome taxon 562 --assembly-level chromosome,complete --assembly-source genbank --limit none {"total_count": 4847}

A note about comparing counts from the genome table page and the command-line tool: On the genome table page we show the genome count as the number of GenBank (GCA)/RefSeq (GCF) assembly pairs, to avoid counting genomes twice. On the CLI, GCA and GCF records are counted separately (as you found in your other issue, #342).

Another note about the assembly level filter on the genome table page: Although in the example that you shared above, the filter is working as expected, I noticed that if you try to select "chromosome" level genomes only, then we are showing both chromosome and complete genomes. This is a bug and we are going to try to fix this in the next 2-4 weeks.

Best, Eric

mkdevesh commented 7 months ago

Thank you so much this helps a lot.