theosanderson / taxonium

A tool for exploring very large trees in the browser
http://taxonium.org
GNU General Public License v3.0
98 stars 16 forks source link

Integer strains break metadata in newick_to_taxonium #600

Closed joel-winterton closed 1 month ago

joel-winterton commented 1 month ago

If each strains has an integer ID so that the Newick tree would be in the format: ((1:0.4,(2:0.1,3:0.1):0.2):0.2,(4:0.1,5:0.1):0.3);

then using metadata in the following CSV format:

strain, location
1, 0
2, 1
3, 0
4, 1
5, 0

will cause newick_to_taxonium (and when converted usher_to_taxonium) to fail to assign metadata to nodes.

The issue is that when pandas loads the metadata CSV, it casts the strain column as an integer since it can, whereas the loading of a newick tree always casts the strains as strings, so the types are not equal and searching by strain from metadata will result in no matches in the tree.

A temporary workaround is just renaming a strain to a non-number format to force the column to be read as a string, however this is a bit hacky for me to place in my pipeline since my strain names being integers is heavily embedded in my pipeline.

I've not managed to setup a local version of taxonium on my machine, but it should be fixable by changing how the must_have_cols is parsed in read_metadata in taxoniumtools/src/taxoniumtools/utils.py, maybe by appending metadata.index = metadata.index.astype(str) after line 25.

theosanderson commented 1 month ago

Thank you for reporting, investigating and suggesting the fix. I'll try to fix this soon.