nextstrain / augur

Pipeline components for real-time phylodynamic analysis
https://docs.nextstrain.org/projects/augur/
GNU Affero General Public License v3.0
268 stars 127 forks source link

filter: Subsampling crashes on numeric dates #844

Open victorlin opened 2 years ago

victorlin commented 2 years ago

Current Behavior

Having any floating-point numeric date crashes subsampling if grouping by year or year month.

Expected behavior

Not sure, but not a crash.

How to reproduce

echo 'strain\tcountry\tdate
SEQ1\tA\t2018.23
SEQ2\tB\t2018-03-25' > metadata.tsv

augur filter \
  --metadata metadata.tsv \
  --group-by country year \
  --subsample-max-sequences 10 \
  --output-strains out.txt

Output:

Traceback (most recent call last):
...

TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'

The above exception was the direct cause of the following exception:
...

TypeError: cannot safely cast non-equivalent float64 to int64

Specifically, this is coming from augur.filter.get_groups_for_subsampling.

Click to expand Python example ```py import augur import pandas as pd columns = ['strain', 'date', 'country'] data = [ ("SEQ_1","2018.23","A"), ("SEQ_2","2020-03-25","A"), ] metadata = pd.DataFrame.from_records(data, columns=columns).set_index('strain') strains = metadata.index.tolist() groups = ['country', 'year', 'month'] augur.filter.get_groups_for_subsampling(strains, metadata, group_by=groups) ``` Output: ``` ... 125 try: --> 126 return values.astype(dtype, casting="safe", copy=copy) 127 except TypeError as err: TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe' ```

Your environment: if running Nextstrain locally

victorlin commented 2 years ago

The culprit code is from my rewrite of get_groups_for_subsampling in #794:

https://github.com/nextstrain/augur/blob/af23a7bcb7542d5ae61cc76841661dddecd43d39/augur/filter.py#L918-L922

This function is assuming that everything is in a potentially ambiguous/incomplete YYYY-MM-DD format.

However, it looks like this was already the case before the PR:

https://github.com/nextstrain/augur/blob/d553f5ea01bbf99637f6faa30f26ee9cdeee96c6/augur/filter.py#L904-L917

So I'm not sure that this was ever a supported use-case, likely unintentionally.

victorlin commented 2 years ago

As for expected behavior, one possibility is that Augur is smart enough to extract year/month/day from numeric dates, or at least year/month which are supported --group-by values.

This could leverage treetime.utils.datetime_from_numeric which converts a numeric date into a datetime.datetime:

import treetime
treetime.utils.datetime_from_numeric(2018.23)
# datetime.datetime(2018, 3, 25, 0, 0)

However, this won't work for non-positive dates:

treetime.utils.datetime_from_numeric(-2018.23)
# ...
# ValueError: year -2018 is out of range

Which is part of a larger discussion of how to support various date formats in Augur: #882