Open benjeffery opened 2 years ago
Also, support for unary nodes would be good (but I suspect this will happen automatically)
Also, support for unary nodes would be good (but I suspect this will happen automatically)
Yes, thanks! Added to the list. @hyanwong @szhan do you have anything to add?
I made this point somewhere else, but the ability to parse into a table collection would mean that we could cope with zero-length (or even negative-length) branches.
Also, I'm not sure if we want to worry about trees beginning with [&U]
and [&R]
. We only "do" rooted trees in tskit, AFAIK.
I made this point somewhere else, but the ability to parse into a table collection would mean that we could cope with zero-length (or even negative-length) branches.
What would one then do with the table collection? There's currently no way to get to a Tree
without satisfying tskit's node time constraints, right?
What would one then do with the table collection?
I would run through the node times and adjust them to make a valid TS, while adding stuff to the metadata to say what I had done.
What would one then do with the table collection?
I would run through the node times and adjust them to make a valid TS, while adding stuff to the metadata to say what I had done.
Great - was checking we didn't need to add a way to get to the quintuple array representation without the integrity constraints. Now you say it I remember you mentioning the fixing-and-recording.
It should be easy to find the "bad" nodes by doing
bad_edges = np.where(tables.nodes.time[tables.edges.parent] <= tables.nodes.time[tables.edges.child])[0]
right? No need to traverse the tree. Fixing them might be more difficult, however, although zero length branches can probably be fixed by adding an epsilon or using last after
, as we do in split_polytomies
Yes good idea - I think that fixing non-conformant tree sequences sounds like a job for a separate function/set of functions though and not part of this work.
Yes, 100%
Incidentally, what is the current status of tskit's Newick parsing capabilities, @benjeffery? Do we have (relatively) fast parsing in a released version yet?
Still a proof of concept I'm afraid! It's on the medium term Todo list...
A quick additional note here: I wonder if we should allow nodes without branch lengths in the Newick file, and then set the time_units
to uncalibrated. We could have a parameter from_newick(missing_length=X)
which specifies the value to use if the branch length is missing. Perhaps if None
we error out on missing branch lengths. missing_length=0
could also be allowed, to create an invalid tree sequence but a valid set of tables (see https://github.com/tskit-dev/tsconvert/issues/36#issuecomment-1086181119).
This would also be useful for the "introduction to tskit for phylogeneticists" tutorial (https://github.com/tskit-dev/tutorials/issues/163)
Our current newick parsing relies on the
newick
python library, which for trees we've tested is ~120x slower than parsing with a common (C extension)R
library. My initial rough experiments with a pure-python one-pass pre-allocated state machine in https://github.com/tskit-dev/tsconvert/issues/36 are very promising, timing at 1-2x the R library. A C implementation could follow if we see the benefit and the Python implementation is amenable to conversion.Requirements:
individuals
?)Support for unicode is not required.