Closed fabio-t closed 1 year ago
Hi @WSDeWitt, for my data this would not be optimal, while the code with the above fix is working fine for me. It's true that I lose the mapping in some way, but the final plots look fine.
I guess a great solution would be to check the header for the presence of abundance=X
just like it's done downstream in the pipeline (I think gctree infer generates those files, but am not sure at the moment). If it's there, then this becomes the abundance of that particular node.
In case of id-as-abundances, this was a proper bug: if you had two nodes with the same abundance, it would crash. But it extends also to normal IDs: since it's the actual sequence that determines uniqueness (key of dictionary), the sequence name can be ambiguous. This is useful if, for example, I want the labels in the tree to show the amino acid CDR3, which obviously will be the same for many nodes.
One additional improvement would be to parse a header like ">seqname abundance=5", to have the best of both worlds: meaningful sequence IDs and their abundances.
NB: this also includes a small fix in
phylip_parse
whereby the script would always crash, because it requires the output file to be open in binary mode.