matsengrp / historydag

https://matsengrp.github.io/historydag
GNU General Public License v3.0
0 stars 1 forks source link

does historydag implicitly assume that a tree's sequences are named `sequence`? #43

Closed matsen closed 1 year ago

matsen commented 1 year ago

When I run the play.py script in /fh/fast/matsen_e/shared/hdag-linearham/2022-11-29-vrc256-linearham-run, I get the error:

AttributeError: 'Label' object has no attribute 'sequence'

This script uses

dag = hdag.history_dag_from_etes(ete_trees, ["ancestral"])

which I thought would mark the "ancestral" node as being the one that holds the sequence.

I think the problem is here:

https://github.com/matsengrp/historydag/blob/main/historydag/utils.py#L162

in which we wrap the hamming distance with a sequence accessor.

It seems like we could either generalize this or just require the sequences to be marked with "sequence".

willdumm commented 1 year ago

I don't have permissions to look at your script, but yes, it's expected that if labels contain a nucleotide sequence, it's stored in a 'sequence' label attribute.

If input ete_trees contain sequences stored in node attributes named ancestral, then you'd be expected to do something like

dag = hdag.history_dag_from_etes(ete_trees, [], label_functions = {'sequence': (lambda n: n.ancestral)})

to explicitly map the ancestral node attribute in the input trees to the sequence label attribute.

matsen commented 1 year ago

I guess in skimming the quickstart I would have thought that

dag = hdag.history_dag_from_etes(ete_trees, ['ancestral'])

would have done the trick.

Can we drop the second argument and just use

dag = hdag.history_dag_from_etes(ete_trees)

?

willdumm commented 1 year ago

I do want to welcome suggestions about how we can make this better, but I'm not sure I understand yet...

dag = hdag.history_dag_from_etes(ete_trees, ['ancestral'])

would create a hDAG with sequences contained in a label attribute with name ancestral, while (as you pointed out) all the default weight counting functions assume sequences are stored in a sequence attribute.

Node labels are namedtuples, and fields can contain arbitrary types of data. For instance, we could have node labels which are (sequence, abundance) tuples...

The second argument to history_dag_from_etes is a quick way to specify which node attributes should be preserved as label attributes, keeping the same name.

Your proposal would work, if we assumed that we always want the hDAG node labels to have a sequence attribute, but then what do we do when we want to merge trees whose nodes don't have sequence attributes, to make an unlabeled hDAG?

matsen commented 1 year ago

OK, I made some suggestions in the PR that would have helped my understanding.