When parsing FASTA headers containing "0" or "99999" as a date value, augur parse fails with an unhelpful error message instead of handling the invalid date more gracefully.
Steps to Reproduce
Create a test FASTA file:
echo -e ">someid|0\nACTG" > test.fa
Run augur parse:
augur parse --sequences test.fa --output-sequences out.fa --output-metadata out.tsv --fields id date --fix-dates dayfirst
Current Behavior
The command fails with error:
ERROR: Invalid date '0-XX-XX': Date does not match format `%Y-%m-%d`.
Expected Behavior
--fix-dates is documented as an attempt to parse non-standard dates that "should be spot-checked". The documented behavior of fix_dates is to return the input string if fixing fails and emit a warning. So it should never crash.
Stacktrace
The stack trace is actually quite interesting, showing how convoluted the current implementation is:
```
Traceback (most recent call last):
File "/Users/corneliusromer/code/augur_master/augur/parse.py", line 41, in fix_dates
results = parsing.parse_datetime_string_with_reso(d, dayfirst=dayfirst)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: module 'pandas._libs.tslibs.parsing' has no attribute 'parse_datetime_string_with_reso'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "pandas/_libs/tslibs/parsing.pyx", line 440, in pandas._libs.tslibs.parsing.parse_datetime_string_with_reso
File "pandas/_libs/tslibs/parsing.pyx", line 667, in pandas._libs.tslibs.parsing.dateutil_parse
ValueError: day is out of range for month
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/corneliusromer/code/augur_master/augur/parse.py", line 44, in fix_dates
results = parsing.parse_time_string(d, dayfirst=dayfirst)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pandas/_libs/tslibs/parsing.pyx", line 367, in pandas._libs.tslibs.parsing.parse_time_string
File "pandas/_libs/tslibs/parsing.pyx", line 445, in pandas._libs.tslibs.parsing.parse_datetime_string_with_reso
pandas._libs.tslibs.parsing.DateParseError: day is out of range for month
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/corneliusromer/code/augur_master/augur/dates/init.py", line 120, in get_numerical_date_from_value
ambig_date = AmbiguousDate(value, fmt=fmt).range(min_max_year=min_max_year)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/corneliusromer/code/augur_master/augur/dates/ambiguous_date.py", line 48, in init
self.assert_only_less_significant_uncertainty()
File "/Users/corneliusromer/code/augur_master/augur/dates/ambiguous_date.py", line 138, in assert_only_less_significant_uncertainty
if "X" in self.uncertain_date_components["Y"]:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/corneliusromer/code/augur_master/augur/dates/ambiguous_date.py", line 97, in uncertain_date_components
raise InvalidDate(self.uncertain_date,
augur.dates.errors.InvalidDate: Invalid date '0-XX-XX': Date does not match format %Y-%m-%d.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/corneliusromer/code/augur_master/blackbox.py", line 94, in
run_probes()
File "/Users/corneliusromer/code/augur_master/blackbox.py", line 80, in run_probes
result, warnings = probe_date_parser(test_input, dayfirst)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/corneliusromer/code/augur_master/blackbox.py", line 20, in probe_date_parser
result = fix_dates(test_input, dayfirst)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/corneliusromer/code/augur_master/augur/parse.py", line 60, in fix_dates
parsed_date = get_numerical_date_from_value(d, "%Y-%m-%d")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/corneliusromer/code/augur_master/augur/dates/init.py", line 122, in get_numerical_date_from_value
raise AugurError(str(error)) from error
augur.errors.AugurError: Invalid date '0-XX-XX': Date does not match format %Y-%m-%d.
```
Environment
current master of Augur (26.0.0)
Additional Notes
The documentation states that date parsing is a "brittle process that should be spot-checked", suggesting the tool should be more resilient to invalid date formats rather than failing completely.
I discovered this while trying to migrate fix_dates to be compatible with pandas v2. Given the complexity of the current implementation, I thought I'll start treating it as a blackbox, discover current behavior and then reimplement it with the help of unit tests in a future proof way.
It turns out that there are unhandled edge cases in the current implementation.
Description
When parsing FASTA headers containing "0" or "99999" as a date value, augur parse fails with an unhelpful error message instead of handling the invalid date more gracefully.
Steps to Reproduce
Create a test FASTA file:
Run augur parse:
Current Behavior
The command fails with error:
Expected Behavior
--fix-dates
is documented as an attempt to parse non-standard dates that "should be spot-checked". The documented behavior offix_dates
is to return the input string if fixing fails and emit a warning. So it should never crash.Stacktrace
The stack trace is actually quite interesting, showing how convoluted the current implementation is:
https://github.com/nextstrain/augur/blob/873185188afaaae7461a56ac6691388630c6fd35/augur/parse.py#L28-L65
Environment
current master of Augur (26.0.0)
Additional Notes
The documentation states that date parsing is a "brittle process that should be spot-checked", suggesting the tool should be more resilient to invalid date formats rather than failing completely.
I discovered this while trying to migrate
fix_dates
to be compatible with pandas v2. Given the complexity of the current implementation, I thought I'll start treating it as a blackbox, discover current behavior and then reimplement it with the help of unit tests in a future proof way.It turns out that there are unhandled edge cases in the current implementation.