vlshchur / argentum

0 stars 1 forks source link

Tree sequence output format? #4

Open hyanwong opened 5 years ago

hyanwong commented 5 years ago

I'm just wondering if you might think about outputting the tcPBWT in tree sequence format (also for output from https://github.com/nvalimak/argentum)? There might be problems with cyclical structures in the ARG, but I can imagine ways around that. It would be interesting to see if tree sequences give roughly the same compression ratio as the PBWT.

We might be able to help you with this if you would be interested. It would also allow fast random access to your trees.

vlshchur commented 5 years ago

This version https://github.com/nvalimak/argentum can output the ARG by identifying shared nodes of coalescent trees (with --enumerate). The details of the output is specified in the document guide_time_estimate.pdf (it is in that repo too). As far as I understand it should be pretty straightforward to convert it to tree sequence, is not it?

Regarding the implementation in this repository, I am not sure I would have time to do it in the nearest future, though it would be great to have it at some point. Thank you for your suggestion! I will try at least to implement the output in the prune-and-regraft operations over planar tree (as described in the manuscrupt though) in a week or two.

hyanwong commented 5 years ago

Thanks. I'll check out the other version and see if it is easy to convert the shared-nodes format. Am I right in thinking that there might be cycles, though?

vlshchur commented 5 years ago

Yes, cycles are possible. We've never implemented a version without cycles (it would have N log(N) complexity where N is the number of sequences).