ncbi / datasets

NCBI Datasets is a new resource that lets you easily gather data from across NCBI databases.
https://www.ncbi.nlm.nih.gov/datasets
Other
355 stars 39 forks source link

"dataformat does not recognize this input" error since August 17 (not before) on version 16.26.2 #404

Closed corneliusroemer closed 1 week ago

corneliusroemer commented 1 week ago

Describe the bug

Version 16.26.2 was working well until around 9pm UTC on August 17 (around 12 hours ago at time of writing).

Since then, when running dataformat on dataset download results in an error: dataformat doesn't recognize this input ... Error: unknown field "usaState"

To Reproduce

Steps to reproduce the behavior:

  1. Install version 16.26.2 - you can use this docker container with it installed:
    docker run -it ghcr.io/loculus-project/ingest@sha256:2507b529cdf384f207f93c2bab4c0f9a49bf39577b15eed26ad60e8f88ebc756 /bin/bash
  2. Download e.g. West Nile Virus datapackage:
    datasets download virus genome taxon 3048448 --no-progressbar --filename ncbi_dataset.zip
  3. Run dataformat with default configs on this:
    dataformat tsv virus-genome --package ncbi_dataset.zip >metadata_post_extract.tsv
  4. Observer results

Logs:

$ dataformat tsv virus-genome --package ncbi_dataset.zip >metadata_post_extract.tsv
dataformat doesn't recognize this input

For best results
1. Make sure to use --as-json-lines with the datasets command
2. Make sure that you're using the latest version of the datasets command line tool

Use --force to remove this warning.
Download the latest version of the datasets command line tool: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install
Error:  unknown field "usaState"
Usage
  dataformat tsv virus-genome [flags]
...SNIP...

Workaround

Upgrade to the latest version (16.29.0) seems to fix this. A look at its release notes shows an entry:

A new "U.S. State" field has been added to the virus data reports, enabling users to filter virus genome summary and download requests using the two-letter state code (e.g., NY). For additional details, please see the Virus Data Reports Documentation.

It seems plausible that this addition causes the backwards incompatible breakage for dataformats when that field is present in the content received from the server.

ericcox1 commented 1 week ago

Hi @corneliusroemer,

Thanks for your report.

It seems plausible that this addition causes the backwards incompatible breakage for dataformats when that field is present in the content received from the server.

That is correct.

We recommend updating both dataformat and datasets together to avoid this error.

Best, Eric

corneliusroemer commented 1 week ago

I think there's a misunderstanding - this error is not due mismatch between datasets and dataformat. Both are at the same version. The error appears when not updating to 16.29.0 at all.

It seems server side changes applied at the same time as the release of 16.29.0 broke dataformat for everyone who is not using the latest version.

It caused build failure in all Nextstrain ingest jobs, Pathoplexus etc.

It's a real breaking change for everyone who doesn't update to the latest version the second it's out.

Maybe I didn't make this clear enough in my issue report.

The way this should work, I think, is that the client (datasets) tells the server which version it has. And if the version is <16.29.0 then the client does not get the new field, so that it doesn't error on the unexpected usaState field.

If you expect to add more fields in the future that would cause similar breakage for past versions it might make sense to adjust the dataformat command to not error when it gets unexpected fields.

A manual prevention that users can apply to prevent this type of error in the future is to not use dataformat as it appears prone to this breakage - IIUC dataformat only turns a jsonl into a TSV in our case so it shouldn't be hard to replace it with a manual script that doesn't suffer from the type of bug encountered here.

ericcox1 commented 1 week ago

Hi @corneliusroemer,

Thanks for the clarification. You are correct, the new field is a breaking change when downloading a package using datasets and then using dataformat to generate a table.

While we consider how we can avoid such breaking change when we add new fields in the future, here are a couple of options for working around this problem:

  1. Update to the latest version of both datasets and dataformat (as you already figured out)
  2. Use the --force flag with dataformat to ignore the error. You could build this into your pipeline to ignore the error in the future. Please note that with the addition of new fields the order of columns in the table could be affected.
  3. Use datasets summary virus genome taxon 3048448 --as-json-lines | dataformat tsv virus-genome. This avoids triggering the error because datasets summary does not pick up the new field.

Thanks again for your feedback.

Best, Eric