Closed matsen closed 1 year ago
I don't have permissions to look at your script, but yes, it's expected that if labels contain a nucleotide sequence, it's stored in a 'sequence' label attribute.
If input ete_trees
contain sequences stored in node attributes named ancestral
, then you'd be expected to do something like
dag = hdag.history_dag_from_etes(ete_trees, [], label_functions = {'sequence': (lambda n: n.ancestral)})
to explicitly map the ancestral
node attribute in the input trees to the sequence
label attribute.
I guess in skimming the quickstart I would have thought that
dag = hdag.history_dag_from_etes(ete_trees, ['ancestral'])
would have done the trick.
Can we drop the second argument and just use
dag = hdag.history_dag_from_etes(ete_trees)
?
I do want to welcome suggestions about how we can make this better, but I'm not sure I understand yet...
dag = hdag.history_dag_from_etes(ete_trees, ['ancestral'])
would create a hDAG with sequences contained in a label attribute with name ancestral
, while (as you pointed out) all the default weight counting functions assume sequences are stored in a sequence
attribute.
Node labels are namedtuples, and fields can contain arbitrary types of data. For instance, we could have node labels which are (sequence, abundance)
tuples...
The second argument to history_dag_from_etes
is a quick way to specify which node attributes should be preserved as label attributes, keeping the same name.
Your proposal would work, if we assumed that we always want the hDAG node labels to have a sequence
attribute, but then what do we do when we want to merge trees whose nodes don't have sequence attributes, to make an unlabeled hDAG?
OK, I made some suggestions in the PR that would have helped my understanding.
When I run the
play.py
script in/fh/fast/matsen_e/shared/hdag-linearham/2022-11-29-vrc256-linearham-run
, I get the error:This script uses
which I thought would mark the "ancestral" node as being the one that holds the sequence.
I think the problem is here:
https://github.com/matsengrp/historydag/blob/main/historydag/utils.py#L162
in which we wrap the hamming distance with a
sequence
accessor.It seems like we could either generalize this or just require the sequences to be marked with "sequence".