nextstrain / augur

Pipeline components for real-time phylodynamic analysis
https://docs.nextstrain.org/projects/augur/
GNU Affero General Public License v3.0
268 stars 127 forks source link

Allow precise date ranges #1304

Open corneliusroemer opened 11 months ago

corneliusroemer commented 11 months ago

Context

Currently, date ranges can only be defined by ambiguous/incomplete dates in the format YYYY-MM-DD with some characters swapped in with the wildcard character X¹. This only allows date ranges to be bounded by start/end of a year or month and not a more precise range within or across year/month boundaries.

¹ this could be better documented: https://github.com/nextstrain/augur/issues/882

Scope

Anywhere that currently accepts ambiguous/incomplete dates (i.e. uses AmbiguousDate):

Potential solutions

  1. ⛔️ Allow scalar boundary values under new columns e.g. --min-date-column / --max-date-column
  2. ✅ Allow a new range format for values under the existing date column¹
    1. ⛔️ Allow the TreeTime format of [min:max] where min and max are numeric dates
    2. ✅ Allow the ISO format of min/max where min and max are ISO dates
    3. ⛔️ Merge (1) and (2) allowing a format of [min/max] where min and max are any scalar date accepted by Augur (ISO or numeric)

¹ there is a feature request to make the date column name customizable: https://github.com/nextstrain/augur/issues/1443


original issue description

Context

To ensure patient privacy, Denmark bins SARS-CoV-2 collection dates to the Monday of the week the actual collection date lies in.

Currently, we seem to be unable to specify such ambiguity within augur refine even though treetime supports arbitrarily constrained date ranges.

The only workaround right now I can think of is to make the date ambiguous, but that throws away information while also not working in situations where the week crosses a month boundary.

The issue has become particularly noticeable when making BA.2.86 trees where Denmark has provided ~30% of global sequences so far. To remove bias, I could add 3/4 days to the dates, but it would be nice if refine just accepted ranges as such.

Description

Make refine support arbitrary min/max input dates.

Possible solution

The simplest way to implement this would be to accept the min/max date format that's natively supported by treetime: [min:max] as in [2022.3452:2022.3649] (opening bracket, min date as year float, colon, max date as year float, closing bracket)

A neater solution might be to allow min and max dates to be specified through two columns: --min-date and --max-date.

This change could potentially be made across all date handling: it would be more general than our current x'ing strategy, e.g. 2021-10-XX for ambiguous date of month. That'd be a lot of work though.

victorlin commented 1 month ago

This was discussed recently on Slack with some different ideas. I'll plan to work on this soon so I've updated the issue description with latest ideas.

Commenting on some things from the original issue description:

The simplest way to implement this would be to accept the min/max date format that's natively supported by treetime: [min:max] as in [2022.3452:2022.3649]

On the Slack discussion the existing ISO standard format <start>/<end> was brought up as a 2nd option. But I do think the brackets are nice. I've proposed a 3rd option. While it doesn't follow any standards, it seemed like the best of both worlds: the brackets convey an inclusive range notation, and : could be problematic down the road if supporting ISO dates with a time part. Both added to the new issue description.

A neater solution might be to allow min and max dates to be specified through two columns: --min-date and --max-date.

I'll shoot this down as this would be confusing if/when we're to bring this to augur filter which already has --min-date/--max-date. I think it makes more sense to augment the existing handling of ambiguous dates in the date column with added support for precise date ranges.

This change could potentially be made across all date handling: it would be more general than our current x'ing strategy, e.g. 2021-10-XX for ambiguous date of month. That'd be a lot of work though.

Making this change across all date handling is the best/easiest way to do it since date handling is centralized under augur/dates. Doing it only in one subcommand would require extra work and cause more confusion to users.

trvrb commented 1 month ago

Cool! I agree to ditch --min-date and --max-date columns. I'm happy with various options for ranges:

Though I think I most prefer either Pythonic colon/bracket eg [2024-01-01:2024-06-01], or ISO slash eg 2024-01-01/2024-06-01 rather than mishmash. I don't see an issue with : and including time part of date, just because molecular clock dating, sample collection, etc... should never demand this level of precision.

jameshadfield commented 1 month ago

I vote to stay consistent with the ISO format we're already following, I don't see any compelling reasons to deviate from it. It also allows more concise forms like "2008-02-15/03-14", where "/03-14" implies "/2008-03-14".

victorlin commented 1 month ago

I considered blocking this on #936 but #1305 shows that this functionality can be implemented within get_numerical_date_from_value() which should be sufficient.

tsibley commented 1 month ago

Consistent with ISO gets my vote.

In past work for Seattle Flu Study, we used range types in various places. That used Postgres range syntax, which includes support for inclusive ([ and ]) vs. exclusive (( and )) bounds and omitted bounds (e.g. [2008-02-15, )). Would those features be useful/desired here?