reichlab / variant-nowcast-hub

A repository to store COVID-19 variant nowcasts collected as a modeling hub.
MIT License
3 stars 0 forks source link

catch and fail early on errors in downloading data #10

Open elray1 opened 1 month ago

elray1 commented 1 month ago

e.g., I just got the following messages:

2024-05-14T17:00:35.209125Z [info     ] Starting pipeline              as_of_date=2024-05-14 filename=assign_clades.py lineno=193 run_time=20240514T130035
New version of client (16.16.0) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/mac/datasets.
Downloading: /Users/elray/research/epi/covid/variant-modeling/variant-nowcast-hub/data-pipeline/src/covid_variant_pipeline/data/sequence/20240514T130035/ncbi.zip    685kB done
Validating package []
Error: Internal error (invalid zip archive). Please try again

followed by a bunch of other stuff from other steps running before an error actually occurred here:

2024-05-14T17:02:52.280930Z [info     ] Assigning sequences to clades using reference tree /Users/elray/research/epi/covid/variant-modeling/variant-nowcast-hub/data-pipeline/src/covid_variant_pipeline/data/reference/2024-05-14_tree.json filename=assign_clades.py lineno=115
The application panicked (crashed).
Message:  called `Result::unwrap()` on an `Err` value: 
   0: When opening file '"/Users/elray/research/epi/covid/variant-modeling/variant-nowcast-hub/data-pipeline/src/covid_variant_pipeline/data/sequence/20240514T130035/ncbi_dataset/data/genomic.fna"'
   1: No such file or directory (os error 2)

Location:
   /workdir/packages/nextclade/src/io/file.rs:29

Since a correct workflow relies on all steps running, maybe we should fail on errors early on?

bsweger commented 1 week ago

Update: we now abort the pipeline if an API call fails or when no root sequence file is returned with the tree info.

Documenting a recent error that should also be handled. This happened when NCBI returned a successful status code without the sequence information, which subsequently caused an error in the nextclade CLI:

assign_clades --sequence-released-since-date 2024-06-17 --reference-tree-date 2024-06-21
Directory where the clade assignment file will be saved (do not use ~) [/Users/user/code/variant-nowcast-hub/data-pipeline/src/covid_variant_pipeline/data]:
2024-06-21T21:14:53.343689Z [info     ] Starting pipeline              filename=assign_clades.py lineno=174 reference_tree_date=datetime.datetime(2024, 6, 21, 0, 0) run_time=20240621T171451
2024-06-21T21:14:53.347472Z [info     ] NCBI API call starting         filename=sequence.py lineno=40 released_since_date=2024-06-17T00:00:00.000Z
2024-06-21T21:15:22.388839Z [info     ] NCBI API call completed        elapsed=29.04090629192069 filename=sequence.py lineno=57
2024-06-21T21:15:22.412547Z [info     ] NCBI SARS-COV-2 genome package downloaded and unzipped filename=assign_clades.py lineno=48 package_location=PosixPath('/Users/user/code/variant-nowcast-hub/data-pipeline/src/covid_variant_pipeline/data/20240621T171451/ncbi.zip')
2024-06-21T21:15:22.609964Z [info     ] extracted sequence metadata    filename=assign_clades.py lineno=70 metadata_file=PosixPath('/Users/user/code/variant-nowcast-hub/data-pipeline/src/covid_variant_pipeline/data/20240621T171451/2024-06-17-metadata.tsv')
2024-06-21T21:15:22.971290Z [info     ] Reference data retrieved       filename=reference.py lineno=22 tree_updated=2024-06-13
2024-06-21T21:15:23.361086Z [info     ] Reference data saved           filename=assign_clades.py lineno=85 root_sequence_path=/Users/user/code/variant-nowcast-hub/data-pipeline/src/covid_variant_pipeline/data/20240621T171451/2024-06-21_root_sequence.fasta tree_path=/Users/usercode/variant-nowcast-hub/data-pipeline/src/covid_variant_pipeline/data/20240621T171451/2024-06-21_tree.json
2024-06-21T21:15:23.364184Z [info     ] Assigning sequences to clades using reference tree filename=assign_clades.py lineno=95
The application panicked (crashed).
Message:  called `Result::unwrap()` on an `Err` value:
   0: When opening file '"/Users/user/code/variant-nowcast-hub/data-pipeline/src/covid_variant_pipeline/data/20240621T171451/ncbi_dataset/data/genomic.fna"'
   1: No such file or directory (os error 2)

Location:
   /workdir/packages/nextclade/src/io/file.rs:29

Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
Location: packages/nextclade-cli/src/cli/nextclade_loop.rs:77

Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
Traceback (most recent call last):
  File "/Users/user/code/variant-nowcast-hub/data-pipeline/.venv/bin/assign_clades", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/user/code/variant-nowcast-hub/data-pipeline/.venv/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/code/variant-nowcast-hub/data-pipeline/.venv/lib/python3.11/site-packages/rich_click/rich_command.py", line 152, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/Users/user/code/variant-nowcast-hub/data-pipeline/.venv/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/code/variant-nowcast-hub/data-pipeline/.venv/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/usercode/variant-nowcast-hub/data-pipeline/src/covid_variant_pipeline/assign_clades.py", line 182, in main
    merged_data = merge_metadata(config)
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/code/variant-nowcast-hub/data-pipeline/src/covid_variant_pipeline/assign_clades.py", line 122, in merge_metadata
    df_assignments = pl.read_csv(config.assignment_no_metadata_file, separator=";")
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/code/variant-nowcast-hub/data-pipeline/.venv/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 135, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/code/variant-nowcast-hub/data-pipeline/.venv/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 135, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/code/variant-nowcast-hub/data-pipeline/.venv/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 135, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/code/variant-nowcast-hub/data-pipeline/.venv/lib/python3.11/site-packages/polars/io/csv/functions.py", line 419, in read_csv
    df = _read_csv_impl(
         ^^^^^^^^^^^^^^^
  File "/Users/usercode/variant-nowcast-hub/data-pipeline/.venv/lib/python3.11/site-packages/polars/io/csv/functions.py", line 565, in _read_csv_impl
    pydf = PyDataFrame.read_csv(
           ^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: No such file or directory (os error 2): ...covid_variant_pipeline/data/20240621T171451/2024-06-17_clade_assignments_no_metadata.csv