nextstrain / augur

Pipeline components for real-time phylodynamic analysis
https://docs.nextstrain.org/projects/augur/
GNU Affero General Public License v3.0
268 stars 129 forks source link

Update help docs for augur curate format-dates #1651

Open anna-parker opened 6 days ago

anna-parker commented 6 days ago

Current Behavior

Hi! I am running augur curate format-dates, but augur is unable to format any of the dates.

WARNING: Unable to format dates for the following (record, field, date string):
(11, 'date', '1987')
(12, 'date', '1999-05')
...

Expected behavior

I would expect augur to format the dates as per documentation: 1987 should become 1987-XX-XX, '1999-05' should become 1999-05-XX.

How to reproduce

Steps to reproduce the current behavior:

  1. I tried to attach a downsampled tsv file with only the two fields I displayed above - this didn't work - so I added the small example to my branch: https://github.com/anna-parker/marburg-virus-tree/blob/curate_dates/example.tsv, you can download this and then run
  2. run
    augur curate format-dates --metadata example.tsv --date-fields "date" --output-metadata example_output.tsv --failure-reporting warn
  3. See that in example_output.tsv both dates have be formatted as XXXX-XX-XX.

Your environment: if running Nextstrain locally

joverlee521 commented 6 days ago

Hi @anna-parker, sorry if the documentation is not clear here!

You will need to provide the --expected-date-formats option for custom date formats that do not match the current defaults ['%Y-%m-%d', '%Y-%m-XX', '%Y-XX-XX', 'XXXX-XX-XX'].

If you update the command to include --expected-date-formats '%Y' '%Y-%m', it should parse the dates as expected. Using your example data, I was able to run the following without any warnings:

augur curate format-dates \
    --metadata example.tsv \
    --date-fields "date" \
    --expected-date-formats '%Y' '%Y-%m' \
    --output-metadata example_output.tsv \
    --failure-reporting warn

Please let us know if this doesn't work for you and we can investigate further.

corneliusroemer commented 6 days ago

Thanks @joverlee521 for explaining where the issue lies. I hope you don't mind if I reopen this issue as while not a bug, it's a documentation issue that hasn't yet been resolved afaict:

The first line of augur curate docs makes users believe that 2023 -> 2023-XX-XX works by default:

Brave Browser 2024-10-16 19 13 43

https://docs.nextstrain.org/projects/augur/en/stable/usage/cli/curate/format-dates.html

Looking further down the docs, it's strange that --expected-date-formats is listed under the "REQUIRED" header if it has defaults set and hence isn't actually required. It's only required in the sense that it's a no-op if the argument isn't passed.

Brave Browser 2024-10-16 19 13 46

So given how it's documented, one should either remove the defaults so that it's actually required, or move it to optional, or change the defaults.

corneliusroemer commented 6 days ago

There's one more oddity in the docs: Default being stated twice, once as true and once as false

Brave Browser 2024-10-16 19 25 40

It might be nice to mention a standard usage as well, something like this which was generated by ChatGPT after reading the docs:

augur curate format-dates --metadata metadata.tsv --output curated_metadata.tsv --date-fields date_column --expected-date-formats "%Y" "%Y-%m" "%Y-%m-%d"
joverlee521 commented 5 days ago

The first line of augur curate docs makes users believe that 2023 -> 2023-XX-XX works by default:

Ah, I see that now. Is it clearer if updated to

Screenshot 2024-10-16 at 10 55 42 AM

So given how it's documented, one should either remove the defaults so that it's actually required, or move it to optional, or change the defaults.

I missed this when I added the default values in https://github.com/nextstrain/augur/pull/1501. I'll move --expected-date-formats down to the optional section.

Default being stated twice, once as true and once as false

The --no-mask-failure oddity is noted in https://github.com/nextstrain/augur/issues/1585 and is an Augur-wide issue that will be resolved separately.

corneliusroemer commented 5 days ago

Thanks @joverlee521! Those edits are great! What I don't yet understand is what format-dates does if run with default values. Does it error if any of the dates are not one of the expected default formats?

joverlee521 commented 5 days ago

What I don't yet understand is what format-dates does if run with default values. Does it error if any of the dates are not one of the expected default formats?

If only run with default values it will just be no-op and will error if the dates do not match the default formats.

Hmm, maybe it would be better to just mark the --expected-date-formats option as explicitly required even though it has default values.

anna-parker commented 5 days ago

@joverlee521 thanks for looking into this! I think the screenshot you provided already makes this much clearer

Screenshot 2024-10-16 at 10 55 42 AM