nextstrain / augur

Pipeline components for real-time phylodynamic analysis
https://docs.nextstrain.org/projects/augur/
GNU Affero General Public License v3.0
268 stars 128 forks source link

parse: crash on invalid dates "0", "99999" instead of handling gracefully as intended #1669

Open corneliusroemer opened 2 weeks ago

corneliusroemer commented 2 weeks ago

Description

When parsing FASTA headers containing "0" or "99999" as a date value, augur parse fails with an unhelpful error message instead of handling the invalid date more gracefully.

Steps to Reproduce

  1. Create a test FASTA file:

    echo -e ">someid|0\nACTG" > test.fa
  2. Run augur parse:

    augur parse --sequences test.fa --output-sequences out.fa --output-metadata out.tsv --fields id date --fix-dates dayfirst

Current Behavior

The command fails with error:

ERROR: Invalid date '0-XX-XX': Date does not match format `%Y-%m-%d`.

Expected Behavior

--fix-dates is documented as an attempt to parse non-standard dates that "should be spot-checked". The documented behavior of fix_dates is to return the input string if fixing fails and emit a warning. So it should never crash.

Stacktrace

The stack trace is actually quite interesting, showing how convoluted the current implementation is:

  1. Tries pandas newer method: parse_datetime_string_with_reso() ❌ (AttributeError)
  2. Falls back to pandas older method: parse_time_string() ❌ (DateParseError)
  3. Tries custom parsing with get_numerical_date_from_value() ❌ (InvalidDate)
  4. Finally raises AugurError

https://github.com/nextstrain/augur/blob/873185188afaaae7461a56ac6691388630c6fd35/augur/parse.py#L28-L65

``` Traceback (most recent call last): File "/Users/corneliusromer/code/augur_master/augur/parse.py", line 41, in fix_dates results = parsing.parse_datetime_string_with_reso(d, dayfirst=dayfirst) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AttributeError: module 'pandas._libs.tslibs.parsing' has no attribute 'parse_datetime_string_with_reso' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "pandas/_libs/tslibs/parsing.pyx", line 440, in pandas._libs.tslibs.parsing.parse_datetime_string_with_reso File "pandas/_libs/tslibs/parsing.pyx", line 667, in pandas._libs.tslibs.parsing.dateutil_parse ValueError: day is out of range for month During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/Users/corneliusromer/code/augur_master/augur/parse.py", line 44, in fix_dates results = parsing.parse_time_string(d, dayfirst=dayfirst) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "pandas/_libs/tslibs/parsing.pyx", line 367, in pandas._libs.tslibs.parsing.parse_time_string File "pandas/_libs/tslibs/parsing.pyx", line 445, in pandas._libs.tslibs.parsing.parse_datetime_string_with_reso pandas._libs.tslibs.parsing.DateParseError: day is out of range for month During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/Users/corneliusromer/code/augur_master/augur/dates/init.py", line 120, in get_numerical_date_from_value ambig_date = AmbiguousDate(value, fmt=fmt).range(min_max_year=min_max_year) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/corneliusromer/code/augur_master/augur/dates/ambiguous_date.py", line 48, in init self.assert_only_less_significant_uncertainty() File "/Users/corneliusromer/code/augur_master/augur/dates/ambiguous_date.py", line 138, in assert_only_less_significant_uncertainty if "X" in self.uncertain_date_components["Y"]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/corneliusromer/code/augur_master/augur/dates/ambiguous_date.py", line 97, in uncertain_date_components raise InvalidDate(self.uncertain_date, augur.dates.errors.InvalidDate: Invalid date '0-XX-XX': Date does not match format %Y-%m-%d. The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/Users/corneliusromer/code/augur_master/blackbox.py", line 94, in run_probes() File "/Users/corneliusromer/code/augur_master/blackbox.py", line 80, in run_probes result, warnings = probe_date_parser(test_input, dayfirst) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/corneliusromer/code/augur_master/blackbox.py", line 20, in probe_date_parser result = fix_dates(test_input, dayfirst) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/corneliusromer/code/augur_master/augur/parse.py", line 60, in fix_dates parsed_date = get_numerical_date_from_value(d, "%Y-%m-%d") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/corneliusromer/code/augur_master/augur/dates/init.py", line 122, in get_numerical_date_from_value raise AugurError(str(error)) from error augur.errors.AugurError: Invalid date '0-XX-XX': Date does not match format %Y-%m-%d. ```

Environment

current master of Augur (26.0.0)

Additional Notes

The documentation states that date parsing is a "brittle process that should be spot-checked", suggesting the tool should be more resilient to invalid date formats rather than failing completely.

I discovered this while trying to migrate fix_dates to be compatible with pandas v2. Given the complexity of the current implementation, I thought I'll start treating it as a blackbox, discover current behavior and then reimplement it with the help of unit tests in a future proof way.

It turns out that there are unhandled edge cases in the current implementation.