nextstrain / augur

Pipeline components for real-time phylodynamic analysis
https://docs.nextstrain.org/projects/augur/
GNU Affero General Public License v3.0
268 stars 128 forks source link

fix & test augur export for v2 (unified) JSONs #297

Closed jameshadfield closed 5 years ago

jameshadfield commented 5 years ago

Currently augur export works well for v1 JSONs (i.e. meta + tree files) but is generally broken for v2 (henceforth referred to as "unified") JSON production. This issue asks to get this functionality working for unified JSONs as currently consumed by auspice on the v2 branch: https://github.com/nextstrain/auspice/tree/v2. This functionality is urgently needed in order to better test auspice v2. Note that some unified JSON functionality in augur export is present, however it doesn't generate JSONs which validate against the schema (see below).

If auspice from the v2 branch is installed (e.g. npm install auspice@alpha -- see https://github.com/nextstrain/auspice/pull/720 for more deatils & background) then auspice convert is available which converts meta+tree JSONs to the unified JSON. This may be helpful to compare augur vs auspice approaches.

schema & config modifications

While there are multiple issues proposing slight revisions to the schema for the unified JSON, this issue asks it to be implemented as per the JSON schema defined here: https://github.com/nextstrain/augur/pull/277/files#diff-c021e1723b2195f3dada2595f3d70ebb. After it is working for this it'll be simple to add slight modifications as needed before the we release a new version.

As far as this issue is concerned, it's simplest to use the same (or similar) auspice config JSON. There are proposals to change this, however this discussion is best held somewhere else, with subsequent modifications added in after this issue has been closed.

User interface for augur export

@emmahodcroft and I talked a lot about this a few weeks ago and decided that it's desirable to have command line arguments available to replace a config file for simple use cases The command line arguments are currently implemented in augur master. For complicated cases it's necessary to use a JSON config file (see above), however the ability to use a config file is not yet implemented in augur.

emmahodcroft commented 5 years ago

@jameshadfield With regard to the error in augur export V2 legend order:

The new unified JSON schema changes how colouring is specified. In V1 JSONs color_map is a list of lists: [ ["<1y", "#d7191c"], ["1-5y", "#fdae61"] .... ] In V2 JSONs, scale is a dict: { "1-5y": "#fdae61", "18-64y": "#abd9e9", ... }

So this is where the ordering problem comes from. As we've found before, though dicts technically do not have an order, it looks like through augur export V2 the order is actually maintained - until the JSON is written out.

This is because in utils.py write_json, json.dump has the flag sort_keys set to True. (And it seems to do this alphabetically.) (This doesn't affect V1 legends because it's a list, I guess.)

Setting this to False fixes the legend ordering problem, but messes up the overall JSON order, so tree is above all the colourings, authors, etc (which personally I find much less useful - currently it's easy to check on your metadata stuff without scrolling past a huge tree). However, we could probably fix this in export by just adding things to unified in the order we want, rather than as we process them.

So one solution would be to add an argument to utils.py write_json so that sort_keys can be set False and use this in augur export V2. But, we should think about what else this will impact.

What think you?

tsibley commented 5 years ago

@emmahodcroft It's generally unreliable to rely on the order of keys in a dictionary when crossing languages, regardless of what order they appear in when serialized to JSON. Different languages make different guarantees about preserving key ordering. I think the simplest thing would be to keep the scale as a list of lists, which is a standard and reasonable approach to cross-language ordered key-value pairs. This also allows arbitrary user-defined ordering rather than forcing an ASCII-betical sort. @jameshadfield What motivated the change from a list of lists to a dict? Could we keep it as a list of lists?

jameshadfield commented 5 years ago

I don't see a problem with returning to the v1 schema here (list of lists). Can't remember anything important about why I proposed turning it into a dict.

kairstenfay commented 5 years ago

FYI @joverlee521 and I just finished going through the augur export process for the unified JSONs. Here is what my process looked like. @jameshadfield asked me to note any 'WTF' moments, which are tagged with WTF below:

Kairsten's augur export process

With help from Jover

  1. Started with the zika-colombia repo. Ran nextstrain build .

  2. When running augur export and seeing the required parameters in the help message, I wanted to run

    augur export -t auspice/zika-colombia_tree.json --metadata auspice/zika-colombia_meta.json
  3. When that didn't work, looked at the Snakefile.

  4. Tried

    augur export \
    --tree results/tree.nwk \
    --metadata results/metadata.tsv \
    --node-data results/branch_lengths.json \
                results/traits.json \
                results/nt_muts.json \
                results/aa_muts.json

    and received the message ERROR: config file None not found. as well as a TypeError: expected str, bytes or os.PathLike object, not NoneType

  5. Switched to zika-tutorial. Ran nextstrain build . there, too.

  6. Copied and ran the following code from the tutorial:

    augur export \
    --tree results/tree.nwk \
    --metadata data/metadata.tsv \
    --node-data results/branch_lengths.json \
                results/traits.json \
                results/nt_muts.json \
                results/aa_muts.json

    Got the same errors as above. WTF

  7. Ran the code from step 4 again in zika-colombia but with the flag --new-schema (thanks to Jover's investigation).

    Got the following errors: ERROR: file with states does not exist TypeError: expected str, bytes or os.PathLike object, not NoneType

  8. Ran the code from step 6 again in zika-tutorial but with the flag --new-schema and got this warning plus error:

    WARNING: Contradictory author information: {'authors': 'Ho et al', 'title': 'Outbreak of Zika virus infection in Singapore: an epidemiological, entomological, virological, and clinical analysis', 'journal': 'Lancet Infect Dis (2017) In press', 'url': 'https://www.ncbi.nlm.nih.gov/nuccore/KY241688'} vs {'authors': 'Ho et al', 'title': 'Outbreak of Zika virus infection in Singapore: an epidemiological, entomological, virological, and clinical analysis', 'journal': 'Lancet Infect Dis (2017) In press', 'url': 'https://www.ncbi.nlm.nih.gov/nuccore/KY241726'}
    ...

    TypeError: expected str, bytes or os.PathLike object, not NoneType

    WTF

  9. Back to zika-colombia: Jover says I need an output file. Ran the following code:

    augur export --tree results/tree.nwk --metadata data/metadata.tsv --node-data results/branch_lengths.json results/traits.json results/nt_muts.json results/aa_muts.json --new-schema --output-main out.json

    and received the error: ERROR: file with states does not exist

    but out.json was still generated.

    WTF

    Jover says it looks good but I'm missing metadata.

  10. Ran augur export -h.

    Tried to match the augur options with what was listed in the config/auspice_config.json.

    Note: maintainers and maintainer-urls is split in augur but not in the auspice config. WTF

    geography-traits? WTF Matched with geo in auspice config.

    extra-traits: skipped WTF

    panels: gave nothing b/c of default

    tree-name: skipped WTF

    Ended up running the code in step 8 (in zika-tutorial) plus:

    --title 'Real-time tracking of Zika virus \ evolution'
    --maintainers "Me" \
    --maintainer-urls 'google.com' \
    --geography-traits 'country' 'region'
  11. Moved onto the augur validate process. First ran augur validate out.json. The JSON file is not a positional argument WTF.

    Then, ran augur validate -h. WTF is the nextflu schema?

    Ran: augur validate --json out.json --new-schema

    Received:

    Validating out.json using the main schema (version new)... SUCCESS
    Validating that out.json is internally consistent...
    SUCCESS
jameshadfield commented 5 years ago

Now working in the v6 branch -- with bug fixes etc to be expected