ncbi / datasets

NCBI Datasets is a new resource that lets you easily gather data from across NCBI databases.
https://www.ncbi.nlm.nih.gov/datasets
Other
369 stars 41 forks source link

"Total" sequence length doesn't include organelles #403

Open muffato opened 2 months ago

muffato commented 2 months ago

Before opening an issue, please:

Describe the bug

Hello NCBI !

The assembly GCA_964199945.1 is reported as having a "Total Sequence Length" of 1,327,610,284 bp, but the the Fasta file actually contains 1,328,070,353 bp. The difference is exactly the MT and the plastid.

In https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/dataformat/tsv/dataformat_tsv_genome/:

assmstats-total-sequence-len Assembly Stats Total Sequence Length

To Reproduce

$ datasets summary genome accession GCA_964199945.1 --as-json-lines | dataformat tsv genome --fields assmstats-total-sequence-len --elide-header
1327610284

Expected behavior

I would expect the "total" sequence length to include everything. I would otherwise call it the length of "nuclear" genome only.

Best regards, Matthieu

olearyna commented 1 month ago

Hi muffato

Thank you for highlighting this issue. I agree that it could be clearer, and we’ll work on improving it.

Nuala