theosanderson / taxonium

A tool for exploring very large trees in the browser
http://taxonium.org
GNU General Public License v3.0
98 stars 17 forks source link

Problem creating JOSNL file with newick_to_taxonium #576

Open NicolaDM opened 7 months ago

NicolaDM commented 7 months ago

Hi, I have a 3M tips tree that seems to large for the browser version of Taxonium, and I am trying to visualize it locally. I want to visualize mutations on the tree, and for this I have either a nexus tree, or newick+tsv metadata. This worked fine for 1M trees on the browser version of Taxonium. However now I have to convert this to jasonl to run the Taxonium desktop app. However, when I run

_newick_to_taxonium -i 3M_tree.tree -m 3M_metaData.tsv -o 3Mtree.jsonl -c mutationsInf,errors,Ns

I get the following error message:

_File "/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py", line 1302, in _validate_usecolsnames raise ValueError( ValueError: Usecols do not match columns, columns expected but not found: ['strain']

Do certain columns have to be present in the metadata file to make this work? My data is from MAPLE, not UShER, so the headings of the metadata file are different, is this an issue?

theosanderson commented 7 months ago

Hi @NicolaDM , if you are supplying metadata we need to know how to match up the metadata to the node names in the tree. This needs you either to have node names in a column called strain or to supply a --key_column myAlternativeColumnName parameter that gives another column to use.

NicolaDM commented 7 months ago

Thank you very much Theo, that works! Now I get this error though:

ValueError: Error: The key column 'node' contains non-unique values in the metadata file.

Despite the fact that node names are unique in my file. Is it because names are not allowed to be contained in each other (e.g. a node name should not be a prefix of another node name)?

theosanderson commented 7 months ago

I really suspect your metadata file does have genuinely duplicated entries in the key column - there isn't any complex logic in the code on prefixes or anything. Feel free to email the file if helpful.

theosanderson commented 7 months ago

Hi Nicola,

How are you making your TSV?

Here I have replaced the tabs with pipes for clarity

node|collapsedTo|mutationsInf|Ns|errors
SRR11578335|||5297-5586,22878-23144||

You'll see that the first line has 4 pipes (tabs) but the second has 5 pipes (tabs). These should be equal in a normal TSV.

As a result, pandas is assuming that SRR11578335 here is not the node column but another index column. I can fix this, by setting index_col = False, which will probably in general cause less confusion, but you might also want to look at the TSV generation script.

NicolaDM commented 7 months ago

I see - indeed I carelessly added an extra tab at the end of each non-title row. Thanks, I'll fix it!

NicolaDM commented 7 months ago

Indeed I confirm this fixed the problem and now it works - thanks again!

theosanderson commented 7 months ago

Great!

theosanderson commented 7 months ago

Reopening to consider change to avoid the same confusion in future