nextstrain / rsv

Workflow for RSV analyses on Nextstrain.org
https://nextstrain.org/rsv
6 stars 3 forks source link

ingest failed on rule extend_metadata #65

Closed joverlee521 closed 3 months ago

joverlee521 commented 3 months ago

Yesterday's automated ingest workflow failed:

[batch] [2024-06-11T16:10:05+00:00] Traceback (most recent call last):
[batch] [2024-06-11T16:10:05+00:00]   File "/nextstrain/build/bin/extend-metadata.py", line 54, in <module>
[batch] [2024-06-11T16:10:05+00:00]     clades = pd.read_csv(args.nextclade, index_col=NEXTCLADE_JOIN_COLUMN_NAME,
[batch] [2024-06-11T16:10:05+00:00]   File "/usr/local/lib/python3.10/site-packages/pandas/util/_decorators.py", line 211, in wrapper
[batch] [2024-06-11T16:10:05+00:00]     return func(*args, **kwargs)
[batch] [2024-06-11T16:10:05+00:00]   File "/usr/local/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper
[batch] [2024-06-11T16:10:05+00:00]     return func(*args, **kwargs)
[batch] [2024-06-11T16:10:05+00:00]   File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 950, in read_csv
[batch] [2024-06-11T16:10:05+00:00]     return _read(filepath_or_buffer, kwds)
[batch] [2024-06-11T16:10:05+00:00]   File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 611, in _read
[batch] [2024-06-11T16:10:05+00:00]     return parser.read(nrows)
[batch] [2024-06-11T16:10:05+00:00]   File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1778, in read
[batch] [2024-06-11T16:10:05+00:00]     ) = self._engine.read(  # type: ignore[attr-defined]
[batch] [2024-06-11T16:10:05+00:00]   File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 321, in read
[batch] [2024-06-11T16:10:05+00:00]     index, column_names = self._make_index(date_data, alldata, names)
[batch] [2024-06-11T16:10:05+00:00]   File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/base_parser.py", line 379, in _make_index
[batch] [2024-06-11T16:10:05+00:00]     simple_index = self._get_simple_index(alldata, columns)
[batch] [2024-06-11T16:10:05+00:00]   File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/base_parser.py", line 411, in _get_simple_index
[batch] [2024-06-11T16:10:05+00:00]     i = ix(idx)
[batch] [2024-06-11T16:10:05+00:00]   File "/usr/local/lib/python3.10/site-packages/pandas/io/parsers/base_parser.py", line 406, in ix
[batch] [2024-06-11T16:10:05+00:00]     raise ValueError(f"Index {col} invalid")
[batch] [2024-06-11T16:10:05+00:00] ValueError: Index seqName invalid
joverlee521 commented 3 months ago

Ah, this error is due to changes in Nextclade v3.7.0:

Previously, Nextclade treated output CSV/TSV columns index and seqName as mandatory and they were always present in the output files. In this release they are made configurable. One can:

in CLI: add or omit index and seqName values when using --output-columns-selection argument

The ingest workflow uses the --output-columns-selection option but does not include the seqName column:

https://github.com/nextstrain/rsv/blob/c3b634b02f6afc84b04491bdd51e2c4fba10cc49/ingest/workflow/snakemake_rules/sort.smk#L63-L71

joverlee521 commented 3 months ago

Just FYI @ivan-aksamentov, the change in the --output-columns-selection behavior was a breaking change for this specific workflow.

joverlee521 commented 3 months ago

Just following up that I do not see the --output-columns-selection option used in any other Nextstrain pathogen repos (GH search query) and I have not seen this error in other automated workflows.

ivan-aksamentov commented 3 months ago

Whoops