Odd pandas error coming from augur export

sidneymbell commented 2 years ago

Current Behavior

augur export v2 --tree tree.nwk --metadata ../data/hcv-meta.tsv --node-data ./branch_lengths.json 
./traits.json --lat-longs ../config/lat_longs.tsv --auspice-config ../config/auspice_config.json --output ./hcv.json

ERROR: DataFrame index must be unique for orient='index'.

Initial googling suggests that this might be related to pd.DataFrame.to_json, but I actually don't see any instances of this in the augur codebase. Just to be safe, I validated that my metadata file does have a unique index (per df.index.unique == True).

All other files are outputs of other augur subcommands, roughly following the zika tutorial, adapted for HCV. One note on the data is that for whatever reason many HCV samples in ncbi virus don't have collection dates, so I opted to skip date metadata filtering, timetree inference and clock rate iqr filtering.

Ultimate goal is really just to get a tree json with hcv genotype labels to use for lineage calling in nextclade.

Happy to keep debugging, but I'm hoping that someone's encountered this before?

Your environment: if running Nextstrain locally

mac augur v 18.0.0 installation / env notes: I ended up needing to nuke my nextstrain conda env, only to later discover that even on a completely clean, fresh conda installation, mamba install fails due to an inability to solve the environment (even overnight). So I'm currently operating in a virtualenv with everything pip or brew installed.

Thanks, y'all!

joverlee521 commented 2 years ago

Hi @sidneymbell,

I recently saw a similar error in a completely different pipeline that came from pd.DataFrame.to_dict. Within augur export v2, this error would be coming from reading the metadata.

https://github.com/nextstrain/augur/blob/e1e8bb155c25a2901bcd41423b0f67d14386b6b3/augur/export_v2.py#L999

read_metadata uses either strain or name as the index column depending on which one is available within your metadata file. Could you verify that there are no duplicate values in those columns?

sidneymbell commented 2 years ago

That'll do it 🤦‍♀️ . Thanks, @joverlee521 ! (I love when it's user error -- always the easiest to fix!)

victorlin commented 2 years ago

Hmm, I'm going to re-open this since it's an unhandled error and a better error message would be nice.

victorlin commented 2 years ago

Re: installation notes

I'm currently operating in a virtualenv with everything pip or brew installed.

This may work now, but it'll be hard to keep versions of various packages up to date. A couple questions:

What did you install with brew?
How did you install Auspice? It's not available via pip or brew.

even on a completely clean, fresh conda installation, mamba install fails due to an inability to solve the environment (even overnight)

This doesn't sound right! I think I might've seen this a while ago but forgot how it got resolved. Could you paste the outputs of conda --version and mamba --version?

sidneymbell commented 2 years ago

@victorlin -- Yeah for sure. Conda version is 4.12.0; it's brand new after a full anaconda-clean and reinstallation of anaconda from dmg.

No mamba installation -- failed after many attempts, including letting it try to solve the environment and inspect dependency conflicts overnight. (No clue why it would even have any on a completely fresh conda installation).

Ended up doing

brew tap brewsci/bio
brew install mafft iqtree raxml fasttree

(per docs, which are beautiful btw)

I've got auspice installed from source from awhile ago, although I actually just use auspice.us 99% of the time.

victorlin commented 2 years ago

I see. So Anaconda comes with a lot of bloat, which is why we use Miniconda in the install docs. I don't use Anaconda personally, but maybe it is trying to search in too many channels. You could try --override-channels since conda-forge alone is sufficient for Mamba installation:

conda install -n base -c conda-forge --override-channels mamba --yes

Also, 4.12.0 is not the latest version of Conda – look at the sidebar of the release notes page:

corneliusroemer commented 1 year ago

Just hitting the same error, would be great if it contained better debugging tips for the end user:

rule export:
    input: results/tree.nwk, results/vmr.tsv
    output: auspice/taxonomy.json
    jobid: 0
    reason: Missing output files: auspice/taxonomy.json
    resources: tmpdir=/var/folders/qf/4kkcfypx0gbfb0t9336522_r0000gn/T

        augur export v2         --tree results/tree.nwk         --output auspice/taxonomy.json         --color-by-metadata "Genome composition" "Host source"         --metadata results/vmr.tsv         --metadata-id-column "Virus name(s)"

ERROR: DataFrame index must be unique for orient='index'.
[Mon Oct  9 06:29:56 2023]
Error in rule export:
    jobid: 0
    input: results/tree.nwk, results/vmr.tsv
    output: auspice/taxonomy.json
    shell:

        augur export v2         --tree results/tree.nwk         --output auspice/taxonomy.json         --color-by-metadata "Genome composition" "Host source"         --metadata results/vmr.tsv         --metadata-id-column "Virus name(s)"

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2023-10-09T062955.159249.snakemake.log

nextstrain / augur

Odd pandas error coming from augur export #1059

Current Behavior

Your environment: if running Nextstrain locally