Open muffato opened 2 months ago
datasets --version
Describe the bug
Hello NCBI !
The assembly GCA_964199945.1 is reported as having a "Total Sequence Length" of 1,327,610,284 bp, but the the Fasta file actually contains 1,328,070,353 bp. The difference is exactly the MT and the plastid.
In https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/dataformat/tsv/dataformat_tsv_genome/:
To Reproduce
$ datasets summary genome accession GCA_964199945.1 --as-json-lines | dataformat tsv genome --fields assmstats-total-sequence-len --elide-header 1327610284
Expected behavior
I would expect the "total" sequence length to include everything. I would otherwise call it the length of "nuclear" genome only.
Best regards, Matthieu
Hi muffato
Thank you for highlighting this issue. I agree that it could be clearer, and we’ll work on improving it.
Nuala
Before opening an issue, please:
datasets --version
Describe the bug
Hello NCBI !
The assembly GCA_964199945.1 is reported as having a "Total Sequence Length" of 1,327,610,284 bp, but the the Fasta file actually contains 1,328,070,353 bp. The difference is exactly the MT and the plastid.
In https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/dataformat/tsv/dataformat_tsv_genome/:
To Reproduce
Expected behavior
I would expect the "total" sequence length to include everything. I would otherwise call it the length of "nuclear" genome only.
Best regards, Matthieu