ncbi / datasets

NCBI Datasets is a new resource that lets you easily gather data from across NCBI databases.
https://www.ncbi.nlm.nih.gov/datasets
Other
347 stars 39 forks source link

`dataformat` errors on non-latest version: `Dataformat doesn't recognize this input: Error: proto: (line 10:7): unknown field "uncompressedMd5Hex"` #353

Closed corneliusroemer closed 4 months ago

corneliusroemer commented 4 months ago

Before opening an issue, please:

Describe the bug

Dataformat suddenly throws a cryptic error:

dataformat doesn't recognize this input
For best results
1. Make sure to use --as-json-lines with the datasets command
2. Make sure that you're using the latest version of the datasets command line tool
Use --force to remove this warning.
Download the latest version of the datasets command line tool: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install
Error: proto: (line 10:7): unknown field "uncompressedMd5Hex"

To Reproduce

datasets download virus genome taxon 186538             --no-progressbar             --filename results/ncbi_dataset.zip
dataformat tsv virus-genome             --package results/ncbi_dataset.zip             --fields accession,bioprojects,biosample-acc,completeness,gene-count,geo-location,geo-region,host-common-name,host-infraspecific-breed,host-infraspecific-cultivar,host-infraspecific-ecotype,host-infraspecific-isolate,host-infraspecific-sex,host-infraspecific-strain,host-name,host-pangolin,host-tax-id,is-annotated,is-complete,is-lab-host,is-vaccine-strain,isolate-collection-date,isolate-lineage,isolate-lineage-source,lab-host,length,matpeptide-count,mol-type,nucleotide-completeness,protein-count,purpose-of-sampling,release-date,sourcedb,sra-accs,submitter-affiliation,submitter-country,submitter-names,update-date,virus-common-name,virus-infraspecific-breed,virus-infraspecific-cultivar,virus-infraspecific-ecotype,virus-infraspecific-isolate,virus-infraspecific-sex,virus-infraspecific-strain,virus-name,virus-pangolin,virus-tax-id             > results/metadata_post_extract.tsv

Expected behavior No error. Things worked fine just a few hours ago.

ericcox1 commented 4 months ago

Hi @corneliusroemer,

Thanks for opening this issue. It looks like we introduced this bug today and we will start working on a fix soon.

You have two options to get this working:

  1. Update to the latest version of the client, v16.13.0
  2. Use --force with dataformat to ignore the error

Best, Eric

Eric Cox, PhD [Contractor] (he/him/his) NCBI Datasets NIH/NLM/NCBI eric.cox@nih.gov

corneliusroemer commented 4 months ago

Thanks @ericcox1 for your quick reply! I'll try upgrading to 16.13.0 and see if that fixes it.

Are you sure --force fixes it? The output says "Use --force to remove this warning." but then below there's an "Error" which is usually considered worse than a warning: "Error: proto: (line 10:7): unknown field "uncompressedMd5Hex""

I'm wondering whether you're running tests in CI to catch such errors before releasing? If you have a staging environment, you could ensure old versions (of the same major version) still function against the new server code (assuming this is what happens: you changed something on the server and it breaks old clients).

corneliusroemer commented 4 months ago

This issue appeared across multiple independent pipelines (Nextstrain and non-Nextstrain) that I'm monitoring, so it's likely a high-profile bug that's breaking things for a lot of users.

I can confirm that upgrading to 16.13.0 works. I haven't tested adding --force.

ericcox1 commented 4 months ago

Hi @corneliusroemer,

We rolled back the change that caused this bug, so previous versions of the client should work normally again.

Thanks again for your report.

Best, Eric