niemasd / TreeSwift

TreeSwift: Fast tree module for Python 3
https://niema.net/TreeSwift
GNU General Public License v3.0
75 stars 14 forks source link

parsing nexus files with ancestral state reconstruction #23

Closed pekarj closed 2 years ago

pekarj commented 2 years ago

Hey Niema,

This is a bit of a followup on the last few updates for TreeSwift. I did ancestral state reconstruction using BEAST, creating a nexus file with trees that include a sequence at every node. It doesn't seem like TreeSwift is able to read in the resulting file though. Any chance TreeSwift can be updated to read such a file in (and also include the ASR as a node attribute)?

ancestralStateReconstruction.nexus.zip

niemasd commented 2 years ago

Hey! I tried opening it, and I think it seems to load properly? The sequences of the internal nodes seem to be a part of the edge_params attribute. For example, the following code snippet:

from treeswift import read_tree_nexus
nex = read_tree_nexus("test/ancestralStateReconstruction.nexus")
print(nex['STATE_1600'].root.edge_params)

prints the following:

&oct21update_167vir_wgal_2_reg01.subsampled="ATTTAGTTATTACTTATACTGCGTGCGTGCACGAAGCATGCAGGCGAGTGTTAGCCACACTGATTTTAAAGTTCGTTTAGAGAACAGATCAACAAGAGATCGAGAGTTGATATTAGGTTTTTACAAATACCTTCCCAGGTAATAAAACCAACCAACTCTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCGCTTGGCTGTATGCCTAGTGCACCCACCCAGTATAATTAATAATAAACTTTACTGTCGTTGACAGGAAACGAGTAACTCGTCTATCTTCTGCAGACTGCTTACGGTTTCGTCCGTGTCGCAGTCGATCATCAGCATATCTAGGTTTCGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTGCCTGGTATCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTTCTTCAGGTTCGAGACGTGCTAGTACGTGGCTTTGGAGACTCTGTGGAAGAGGCCCTATCGGAGGCACGTCAACATCTTCAAGATGGCACTTGTGGTTTAGTAGAGCTTGAAAAAGGTGTATTGCCCCAACTTGAACAGCCCTATGTGTTCATTAAACGTTCTGATGCTCTAAACGCACATCACGGCCATGTAATGGTTGAATTGGTAGCAGAACTAGATGGCATTCAGTACGGTCGTAGTGGTGAGACTCTTGGAGTTCTTGTACCCCATGTGGGTGAAAACCCTATTGCGTACCGCAAAGTTCTTCTTCGTAAGAATGGTAATAAGGGAGCTGGTGGCCATAGCTATGGCACCGATCTAAAGTCTTATGACTTAGGTGACGAGCTTGGCACTGATCCCATTGAAGATTTTCAAGAAAACTGGAACACTAAACATGGCAGTGGTGCTGGTCGTGAGCTCATCCGTGAGCTTAATGGAGGCGTAGACACTCGCTATGTCGATAACAACTTCTGTGGCCCTGATGGCTACCCTCTTGAGTGCATTAAAGACCTTCTAGCTCGTGCTGGTAAGGCTTT",region="Yunnan"

So given a node, I can extract the sequence as follows:

node.seq = node.edge_params.split('=')[1].split(',')[0].strip('"')

So for a given tree, to label all nodes accordingly:

for node in nex['STATE_1600'].traverse_preorder():
    node.seq = node.edge_params.split('=')[1].split(',')[0].strip('"')

Can you let me know if that works for you?

pekarj commented 2 years ago

Ah, I was mistaken on the reading in part -- I misinterpreted the output. However, I only seem to get the sequence for the root node. If I do what you wrote above for STATE_1600, and then print each seq:

for node in nex['STATE_1600'].traverse_preorder():
    print(node.seq)

I get the following:

ATTTAGTTATTACTTATACTGCGTGCGTGCACGAAGCATGCAGGCGAGTGTTAGCCACACTGATTTTAAAGTTCGTTTAGAGAACAGATCAACAAGAGATCGAGAGTTGATATTAGGTTTTTACAAATACCTTCCCAGGTAATAAAACCAACCAACTCTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCGCTTGGCTGTATGCCTAGTGCACCCACCCAGTATAATTAATAATAAACTTTACTGTCGTTGACAGGAAACGAGTAACTCGTCTATCTTCTGCAGACTGCTTACGGTTTCGTCCGTGTCGCAGTCGATCATCAGCATATCTAGGTTTCGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTGCCTGGTATCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTTCTTCAGGTTCGAGACGTGCTAGTACGTGGCTTTGGAGACTCTGTGGAAGAGGCCCTATCGGAGGCACGTCAACATCTTCAAGATGGCACTTGTGGTTTAGTAGAGCTTGAAAAAGGTGTATTGCCCCAACTTGAACAGCCCTATGTGTTCATTAAACGTTCTGATGCTCTAAACGCACATCACGGCCATGTAATGGTTGAATTGGTAGCAGAACTAGATGGCATTCAGTACGGTCGTAGTGGTGAGACTCTTGGAGTTCTTGTACCCCATGTGGGTGAAAACCCTATTGCGTACCGCAAAGTTCTTCTTCGTAAGAATGGTAATAAGGGAGCTGGTGGCCATAGCTATGGCACCGATCTAAAGTCTTATGACTTAGGTGACGAGCTTGGCACTGATCCCATTGAAGATTTTCAAGAAAACTGGAACACTAAACATGGCAGTGGTGCTGGTCGTGAGCTCATCCGTGAGCTTAATGGAGGCGTAGACACTCGCTATGTCGATAACAACTTCTGTGGCCCTGATGGCTACCCTCTTGAGTGCATTAAAGACCTTCTAGCTCGTGCTGGTAAGGCTTT
0.002531541819940809
0.002531541819940809
0.002531541819940809
0.002531541819940809
0.002531541819940809
0.002531541819940809
0.002531541819940809
0.002531541819940809
0.002531541819940809
0.002531541819940809
0.002531541819940809
0.002531541819940809
0.002531541819940809
0.002531541819940809
0.002531541819940809
0.002531541819940809
0.002531541819940809
0.002531541819940809
0.002531541819940809
0.002531541819940809
0.002531541819940809
0.002531541819940809
0.002531541819940809
0.002531541819940809
0.002531541819940809
0.002531541819940809
0.002531541819940809
0.002531541819940809
0.002531541819940809
0.002531541819940809
0.002531541819940809
0.002531541819940809
0.002531541819940809
0.002531541819940809
0.002531541819940809
0.002531541819940809
0.002531541819940809
0.002531541819940809
pekarj commented 2 years ago

Also, if I then try to save a given tree to a newick file, I get the following:

[&R] (EPI_ISL_852604:1.6357507323841585[&default.rate=0.0014477,region.rate=1.0],(((((('MG772933.1':1.1728750371368921[&default.rate=0.0014477,region.rate=1.0],(('MZ937000.1':0.668197515923221[&default.rate=0.0014477,region.rate=1.0],('OK017803.1':0.11244129651179363[&default.rate=0.0014477,region.rate=1.0],'OK017805.1':0.18915362527893748[&default.rate=0.0014477,region.rate=1.0]):1.132491004949265[&default.rate=0.0014477,region.rate=1.0]):1.3790022185450421[&default.rate=0.0014477,region.rate=1.0],'MT121216.1':0.7773591775403883[&default.rate=0.0014477,region.rate=1.0]):2.5489405171307915[&default.rate=0.0014477,region.rate=1.0]):0.8084992611541111[&default.rate=0.0014477,region.rate=1.0],'MT040334.1':1.8964427914417872[&default.rate=0.0014477,region.rate=1.0]):0.021448171535575433[&default.rate=0.0014477,region.rate=1.0],((('MW251308.1':0.752610331468559[&default.rate=0.0014477,region.rate=1.0],'MZ081381.1':0.6843043205397228[&default.rate=0.0014477,region.rate=1.0]):0.8620467540602199[&default.rate=0.0014477,region.rate=1.0],'MN908947.3':1.1337378550475021[&default.rate=0.0014477,region.rate=1.0]):0.30419404079068846[&default.rate=0.0014477,region.rate=1.0],('OK017806.1':1.565754161119145[&default.rate=0.0014477,region.rate=1.0],'MZ937003.1':0.8410367421505784[&default.rate=0.0014477,region.rate=1.0]):1.1351914333492206[&default.rate=0.0014477,region.rate=1.0]):3.4635207109748003[&default.rate=0.0014477,region.rate=1.0]):0.9690864609635517[&default.rate=0.0014477,region.rate=1.0],'MG772934.1':1.3828678348998782[&default.rate=0.0014477,region.rate=1.0]):1.7353311055214977[&default.rate=0.0014477,region.rate=1.0],('MZ937002.1':3.718732522061596[&default.rate=0.0014477,region.rate=1.0],(('MZ937001.1':0.852162077262772[&default.rate=0.0014477,region.rate=1.0],EPI_ISL_406801:0.3494298368256459[&default.rate=0.0014477,region.rate=1.0]):2.058149883507049[&default.rate=0.0014477,region.rate=1.0],(EPI_ISL_412977:1.0899278494417612[&default.rate=0.0014477,region.rate=1.0],'MW703458.1':0.191297712455349[&default.rate=0.0014477,region.rate=1.0]):0.7861749639367939[&default.rate=0.0014477,region.rate=1.0]):0.8084205612917748[&default.rate=0.0014477,region.rate=1.0]):4.417237209586674[&default.rate=0.0014477,region.rate=1.0]):2.471512232031648[&default.rate=0.0014477,region.rate=1.0],('KF294457.1':0.18554927299601687[&default.rate=0.0014477,region.rate=1.0],'MN996532.2':1.7444533825851103[&default.rate=0.0014477,region.rate=1.0]):1.908271488498043[&default.rate=0.0014477,region.rate=1.0]):0.6131628476025277[&default.rate=0.0014477,region.rate=1.0])[&oct21update_167vir_wgal_2_reg01.subsampled="AATTAGTTATTAATTATACTGCGTGATTGCACTAAGCATGCAGCCGAGTGACAGCCACACAGATTTTAAAGTTCGTTTAGAGAACAGATCTACAAGAGATCGAGAGTTGATATTAGGTTTTTACAAATACCTTCCCAGGTAACAAAACCAACCAACTCTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTTGGCTGTATGCTTAGTGCACTCACGCAGTATAATTAATAATTAACTTTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTTTCGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTCTTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGCTTTATCAGAGGCACGTCAACATCTTAAAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTTTTGCCTCAACTTGAACAGCCCTATGTGTTCATTAAACGTTCTGATGCTCGAACTGCACCTCATGGCCATGTTATGGTTGAACTGGTAGCAGAACTCGATGGCATTCAGTACGGTCGTAGTGGTGAGACACTTGGTGTTCTTGTTCCTCATGTGGGCGAAACACCTGTGGCTTACCGCAAAGTTCTTCTTCGTAAGAACGGTAATAAAGGAGCTGGTGGCCATAGTTACGGCGCCGATCTAAAGTCTTTTGACTTAGGTGACGAGCTTGGCACTGATCCTTATGAAGATTTTCAAGAAAATTGGAACACTAAACATGGCAGTGGTGTTACCCGTGAACTCATGCGTGAGCTTAACGGAGGAGCATACACTCGCTATGTCGATAACAACTTCTGTGGCCCTGATGGCTATCCTCTTGAGTGCATTAAAGATCTTCTAGCCCGTGCTGGTAAGGCTTC",region="Yunnan"];%

It looks like only the root sequence was correctly parsed and then saved?

niemasd commented 2 years ago

Ah, woops, I only looked at the root and assumed all other (internal) nodes would have the same structure. I'll take a look later to see if I can figure out what's going on!

pekarj commented 2 years ago

Sounds good! This might be helpful (with a regex at the bottom of the linked page):

https://beast.community/nexus_metacomments

niemasd commented 2 years ago

Finally had a bit of time to look at this, and I see what the problem is! The problem is when a node has both edge params and node params as per the BEAST NEXUS metacomments (thanks for the link!). Basically, the node params get overwritten by the edge params. Should be a simple fix

niemasd commented 2 years ago

Okay, added to TreeSwift v1.1.27! Here's an example script:

from treeswift import read_tree_nexus
for node in read_tree_nexus('ancestralStateReconstruction.nexus')['STATE_1600'].traverse_preorder():
    print('LABEL: %s' % node.label)
    if hasattr(node, 'node_params'):
        print('NODE PARAMS: %s' % node.node_params)
    if hasattr(node, 'edge_params'):
        print('EDGE PARAMS: %s' % node.edge_params)
    print()

I also updated the Newick output functionality, which should hopefully properly save the tree with the node and edge params as well. Can you test things out and see if they look okay on your end? In the tree you provided, sequences are now in the node_params attribute of Node objects

niemasd commented 2 years ago

Woops, actually, there's a bug when outputting the Tree object as a Newick string. I'll fix right now

niemasd commented 2 years ago

Okay, I just pushed TreeSwift v1.1.28 that fixes the Newick output functionality as well! Test things out on your end, and once you give me the green light that it seems to be working for you, let me know and I'll close this GitHub Issue

niemasd commented 2 years ago

Haven't heard anything horrendous (yet), so I'll assume it works and close this ticket. If something is still not working, feel free to reopen or create a new ticket 😄