matsengrp / gctree

GCtree: phylogenetic inference of genotype-collapsed trees
https://matsengrp.github.io/gctree
GNU General Public License v3.0
16 stars 2 forks source link

Uniqueness of IDs not necessary #60

Closed fabio-t closed 1 year ago

fabio-t commented 2 years ago

In case of id-as-abundances, this was a proper bug: if you had two nodes with the same abundance, it would crash. But it extends also to normal IDs: since it's the actual sequence that determines uniqueness (key of dictionary), the sequence name can be ambiguous. This is useful if, for example, I want the labels in the tree to show the amino acid CDR3, which obviously will be the same for many nodes.

One additional improvement would be to parse a header like ">seqname abundance=5", to have the best of both worlds: meaningful sequence IDs and their abundances.

NB: this also includes a small fix in phylip_parse whereby the script would always crash, because it requires the output file to be open in binary mode.

fabio-t commented 2 years ago

Hi @WSDeWitt, for my data this would not be optimal, while the code with the above fix is working fine for me. It's true that I lose the mapping in some way, but the final plots look fine.

I guess a great solution would be to check the header for the presence of abundance=X just like it's done downstream in the pipeline (I think gctree infer generates those files, but am not sure at the moment). If it's there, then this becomes the abundance of that particular node.