ENH: Provide helpful error message when metadata file doesn't contain "strain" column

corneliusroemer commented 2 years ago

A lot of users seem to get the following type of error:

Job 3: Exporting data files for for auspice

        augur export v2             --tree results/global/tree.nwk             --metadata data/metadata.tsv
    --node-data results/global/branch_lengths.json results/global/nt_muts.json results/global/aa_muts.json results/global/subclades.json results/global/clades.json results/global/recency.json results/global/traits.json             --auspice-config my_profiles/covid/my_auspice_config.json             --include-root-sequence             --colors results/global/colors.tsv             --lat-longs defaults/lat_longs.tsv             --title 'Genomic epidemiology of novel coronavirus - Global subsampling'             --description my_profiles/covid/my_description.md             --output results/global/ncov_with_accessions.json 2>&1 | tee logs/export_global.txt

    Validating schema of 'results/global/aa_muts.json'...
    Traceback (most recent call last):
      File "/home/charbel/miniconda3/envs/nextstrain/bin/augur", line 10, in <module>
    sys.exit(main())
      File "/home/charbel/miniconda3/envs/nextstrain/lib/python3.8/site-packages/augur/__main__.py", line 10, in main
    return augur.run( argv[1:] )
      File "/home/charbel/miniconda3/envs/nextstrain/lib/python3.8/site-packages/augur/__init__.py", line 75, in run
    return args.__command__.run(args)
      File "/home/charbel/miniconda3/envs/nextstrain/lib/python3.8/site-packages/augur/export.py", line 22, in run
    return run_v2(args)
      File "/home/charbel/miniconda3/envs/nextstrain/lib/python3.8/site-packages/augur/export_v2.py", line 903, in run_v2
    node_data, node_attrs, node_data_names, metadata_names = parse_node_data_and_metadata(T, args.node_data, args.metadata)
      File "/home/charbel/miniconda3/envs/nextstrain/lib/python3.8/site-packages/augur/export_v2.py", line 863, in parse_node_data_and_metadata
    if node["strain"] in node_attrs: # i.e. this node name is in the tree
    KeyError: 'strain'

https://discussion.nextstrain.org/t/error-in-job-3-exporting-data-files-for-for-auspice/493/4

It's a common discussion topic on our forum and also in emails we get to hello@nextstrain.org

I think it would help users a lot if we raised a more informative error so that users know directly how to fix it.

Also, we don't seem to have documented the requirement that the metadata needs to contain a column called strain with strainnames.

Both should be addressed.

corneliusroemer commented 2 years ago

Interestingly, when reading in a metadata file, we seem to be ok with name or strain but then in export we suddenly don't accept name anymore. That's strange.

Should we remove support for name or make export accept name to be in line with metadata_file.py, see: https://github.com/nextstrain/augur/blob/4b71e7d2f35c680c08488f691672bb60e24f5258/augur/util_support/metadata_file.py#L6-L12

huddlej commented 2 years ago

We do support searching for multiple arbitrary strain ids when reading in metadata with the read_metadata function in the io module. This function returns a data frame indexed by the first requested id column that exists in the input. As a result, the calling code can consume the data frame without needing to know what the name of the id column is.

An alternate solution to #906 is to use io.read_metadata in the export module instead of the current call to utils.read_metadata. We could cast the data frame to a dict to avoid changing other code in the module or we could update the logic in parse_node_data_and_metadata to use the data frame. We should really deprecate the utils.read_metadata function, anyway, since io.read_metadata was written to replace it eventually.

nextstrain / augur

ENH: Provide helpful error message when metadata file doesn't contain "strain" column #905