Closed elray1 closed 3 weeks ago
Update: we now abort the pipeline if an API call fails or when no root sequence file is returned with the tree info.
Documenting a recent error that should also be handled. This happened when NCBI returned a successful status code without the sequence information, which subsequently caused an error in the nextclade CLI:
assign_clades --sequence-released-since-date 2024-06-17 --reference-tree-date 2024-06-21
Directory where the clade assignment file will be saved (do not use ~) [/Users/user/code/variant-nowcast-hub/data-pipeline/src/covid_variant_pipeline/data]:
2024-06-21T21:14:53.343689Z [info ] Starting pipeline filename=assign_clades.py lineno=174 reference_tree_date=datetime.datetime(2024, 6, 21, 0, 0) run_time=20240621T171451
2024-06-21T21:14:53.347472Z [info ] NCBI API call starting filename=sequence.py lineno=40 released_since_date=2024-06-17T00:00:00.000Z
2024-06-21T21:15:22.388839Z [info ] NCBI API call completed elapsed=29.04090629192069 filename=sequence.py lineno=57
2024-06-21T21:15:22.412547Z [info ] NCBI SARS-COV-2 genome package downloaded and unzipped filename=assign_clades.py lineno=48 package_location=PosixPath('/Users/user/code/variant-nowcast-hub/data-pipeline/src/covid_variant_pipeline/data/20240621T171451/ncbi.zip')
2024-06-21T21:15:22.609964Z [info ] extracted sequence metadata filename=assign_clades.py lineno=70 metadata_file=PosixPath('/Users/user/code/variant-nowcast-hub/data-pipeline/src/covid_variant_pipeline/data/20240621T171451/2024-06-17-metadata.tsv')
2024-06-21T21:15:22.971290Z [info ] Reference data retrieved filename=reference.py lineno=22 tree_updated=2024-06-13
2024-06-21T21:15:23.361086Z [info ] Reference data saved filename=assign_clades.py lineno=85 root_sequence_path=/Users/user/code/variant-nowcast-hub/data-pipeline/src/covid_variant_pipeline/data/20240621T171451/2024-06-21_root_sequence.fasta tree_path=/Users/usercode/variant-nowcast-hub/data-pipeline/src/covid_variant_pipeline/data/20240621T171451/2024-06-21_tree.json
2024-06-21T21:15:23.364184Z [info ] Assigning sequences to clades using reference tree filename=assign_clades.py lineno=95
The application panicked (crashed).
Message: called `Result::unwrap()` on an `Err` value:
0: When opening file '"/Users/user/code/variant-nowcast-hub/data-pipeline/src/covid_variant_pipeline/data/20240621T171451/ncbi_dataset/data/genomic.fna"'
1: No such file or directory (os error 2)
Location:
/workdir/packages/nextclade/src/io/file.rs:29
Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
Location: packages/nextclade-cli/src/cli/nextclade_loop.rs:77
Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
Traceback (most recent call last):
File "/Users/user/code/variant-nowcast-hub/data-pipeline/.venv/bin/assign_clades", line 8, in <module>
sys.exit(main())
^^^^^^
File "/Users/user/code/variant-nowcast-hub/data-pipeline/.venv/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user/code/variant-nowcast-hub/data-pipeline/.venv/lib/python3.11/site-packages/rich_click/rich_command.py", line 152, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/Users/user/code/variant-nowcast-hub/data-pipeline/.venv/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user/code/variant-nowcast-hub/data-pipeline/.venv/lib/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/usercode/variant-nowcast-hub/data-pipeline/src/covid_variant_pipeline/assign_clades.py", line 182, in main
merged_data = merge_metadata(config)
^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user/code/variant-nowcast-hub/data-pipeline/src/covid_variant_pipeline/assign_clades.py", line 122, in merge_metadata
df_assignments = pl.read_csv(config.assignment_no_metadata_file, separator=";")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user/code/variant-nowcast-hub/data-pipeline/.venv/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 135, in wrapper
return function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user/code/variant-nowcast-hub/data-pipeline/.venv/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 135, in wrapper
return function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user/code/variant-nowcast-hub/data-pipeline/.venv/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 135, in wrapper
return function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user/code/variant-nowcast-hub/data-pipeline/.venv/lib/python3.11/site-packages/polars/io/csv/functions.py", line 419, in read_csv
df = _read_csv_impl(
^^^^^^^^^^^^^^^
File "/Users/usercode/variant-nowcast-hub/data-pipeline/.venv/lib/python3.11/site-packages/polars/io/csv/functions.py", line 565, in _read_csv_impl
pydf = PyDataFrame.read_csv(
^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: No such file or directory (os error 2): ...covid_variant_pipeline/data/20240621T171451/2024-06-17_clade_assignments_no_metadata.csv
We're no longer planning to use the NCBI API.
e.g., I just got the following messages:
followed by a bunch of other stuff from other steps running before an error actually occurred here:
Since a correct workflow relies on all steps running, maybe we should fail on errors early on?