nextstrain / augur

Pipeline components for real-time phylodynamic analysis
https://docs.nextstrain.org/projects/augur/
GNU Affero General Public License v3.0
268 stars 129 forks source link

Add documentation page on supported date formats #882

Open victorlin opened 2 years ago

victorlin commented 2 years ago

740 improves the help text for filter's --min-date/--max-date. Similar documentation should also be provided for metadata dates, which is slightly different (e.g. no support for relative dates). Moving some things over from an old wiki page as a starter:

Overview

Augur supports a variety of date formats:

Generally, this comes down to flavors of numerical or (potentially incomplete) ISO dates.

Implementation

Internally, Augur stores dates in numerical format for the following reasons:

  1. Pre-historic dates (BC) are not supported by some implementations of ISO date. For example, Python’s own datetime.
  2. During initial implementation, dates needed to be numerical for some uses (e.g. timetree) and it was easier to just convert to numerical and treat them this way across the board.

Related discussions

huddlej commented 2 years ago

This is a great idea! We should also consider linking out to examples of ISO 8601 dates, since users may not be familiar with this term (or will not know that they know the associated formats). Linking to the ISO 8601 calendar dates and durations sections on wikipedia would be fine for this.

victorlin commented 2 years ago

Relatedly, I've done a bunch of date parsing work in #854 (see dates.py), but this has yet to be merged.

j23414 commented 2 years ago

Do we want to also support and document the following formats?

Fauna's format_date function seems to process it, but I assume it will be superseded by augur's version. I could also see dropping any parenthetical strings "\s(\S.*)" as a pre-processing step and outside the scope.

victorlin commented 2 years ago

@j23414 yeah, that would be 2018 and 2018-03 in the issue description examples.

The current support for those isn't in dates.py, but rather a hidden feature of augur filter's subsampling logic which only applies to the metadata date column during subsampling:

https://github.com/nextstrain/augur/blob/8014186d35c13d9f131cfa4828c6b7f81932909f/augur/filter.py#L935

As a path to follow, we should aim to support the same date formats across different use cases via functions in dates.py.

huddlej commented 2 years ago

One way to think about incomplete dates like YYYY and YYYY-MM would be to standardize/sanitize these to YYYY-XX-XX and YYYY-MM-XX, respectively, early in a workflow. This could be part of the proposed work in #860.