jeetsukumaran / DendroPy

A Python library for phylogenetic scripting, simulation, data processing and manipulation.
https://pypi.org/project/DendroPy/.
BSD 3-Clause "New" or "Revised" License
210 stars 61 forks source link

[User Error] Bug: Error using TreeList and contained coalescent tree #61

Closed yasuiniko closed 8 years ago

yasuiniko commented 8 years ago

[Edit: added code blocks] Bug Description Hi @jeetsukumaran,

I'm trying to make a bunch of contained coalescent trees, and when I use a TreeList to hold my species trees I get a weird error (traceback below). If I use a regular list there's no problem. I don't think my use case is that strange, since it seems like we're meant to use TreeLists as an efficient list of trees when their taxon namespaces are all the same.

Sorry for butchering all your function names in the demo code.

Best, Niko

Traceback

Traceback (most recent call last):
  File "debug.py", line 49, in <module>
    gene_trees = list(map(make_ctrees, sp_trees))
  File "debug.py", line 47, in <lambda>
    range(n_gene_trees)))
  File "debug.py", line 46, in <lambda>
    make_ctrees = lambda tree: list(map(lambda x: cc_tree(tree, si_map),
  File "/Users/nikoyasui/anaconda3/lib/python3.5/site-packages/dendropy/model/coalescent.py", line 524, in contained_coalescent_tree
    uncoal = coalesce_nodes(nodes=pop_node_genes[edge.head_node],
KeyError: <Node object at 0x1038e7198: 'None' (<Taxon 0x1038e7518 'C'>)>

Demo Code

import dendropy as dp
import string

taxa_map = dp.TaxonNamespaceMapping.create_contained_taxon_mapping
bd_tree = dp.simulate.birth_death_tree
cc_tree = dp.simulate.treesim.contained_coalescent_tree

c = 1
n_sp_trees = 2
n_gene_trees = 1000
n_sp = 5
n_ind = 2 # can be int or list

Ne = 10000
sp_depth = Ne*c

species = dp.TaxonNamespace(string.ascii_uppercase[:n_sp])

if "Creating a list is OK":
    sp_trees = list(map(lambda x: bd_tree(birth_rate=1.0,
                                          death_rate=0,
                                          taxon_namespace=species), 
                        range(n_sp_trees)))

if "Creating a TreeList breaks everything":
    sp_trees = dp.TreeList(map(lambda x: bd_tree(birth_rate=1.0,
                                                 death_rate=0,
                                                 taxon_namespace=species), 
                           range(n_sp_trees)))

elif "Creating a TreeList from a list also breaks everything":
    sp_trees = dp.TreeList(list(map(lambda x: bd_tree(birth_rate=1.0,
                                                      death_rate=0,
                                                      taxon_namespace=species), 
                                    range(n_sp_trees))))

for tree in sp_trees:
    for edge in tree.postorder_edge_iter():
        setattr(edge, 'pop_size', Ne)

si_map = taxa_map(containing_taxon_namespace=species,
                  num_contained=n_ind)

make_ctrees = lambda tree: list(map(lambda x: cc_tree(tree, si_map),
                                    range(n_gene_trees)))

gene_trees = list(map(make_ctrees, sp_trees))

System Info OS Version: OSX 10.11.5 (15F34) DendroPy version : DendroPy 4.1.0 DendroPy location : /Users/nikoyasui/anaconda3/lib/python3.5/site-packages/dendropy Python version : 3.5.1 |Anaconda custom (x86_64)| (default, Dec 7 2015, 11:24:55) [GCC 4.2.1 (Apple Inc. build 5577)] Python executable : /Users/nikoyasui/anaconda3/bin/python3 Python site packages : ['/Users/nikoyasui/anaconda3/lib/python3.5/site-packages']

jeetsukumaran commented 8 years ago

The purpose of a TreeList is not so much efficiency, but (primarily) to manage a collection of trees that reference the same TaxonNamespace. So, by design, any trees added to a TreeList will have their taxa remapped into the TaxonNamespace of the TreeList.

The problem with your code is that while you indeed do carefully manage the taxon namespace of your generated trees, you throw all of that away when you do not ensure that your TreeList instance references that same TaxonNamespace.

So, change your code from:

     sp_trees = dp.TreeList(map(lambda x: bd_tree(birth_rate=1.0,
                                                  death_rate=0,
                                                  taxon_namespace=species),
                            range(n_sp_trees)))

to

     sp_trees = dp.TreeList(map(lambda x: bd_tree(birth_rate=1.0,
                                                  death_rate=0,
                                                  taxon_namespace=species),
                            range(n_sp_trees)),
                            taxon_namespace=species) ## here we 
explicitly give the TreeList instance the correct TaxonNamespace object

On an entirely different note, is there any reason you are using maps instead of simple loops here? Perhaps it is just my functional-programming-crippled mind, but IMHO, for what you are trying to do, there is not anything gained, and the logic is obscured and the denseness of the constructs makes it easy for simple errors to creep in! ;) Of course, I totally understand if this flow seems more natural to you .... different strokes for different folks and all!

On 7/5/16 11:03 PM, Niko Yasui wrote:

Bug Description Hi @jeetsukumaran https://github.com/jeetsukumaran,

I'm trying to make a bunch of contained coalescent trees, and when I use a TreeList to hold my species trees I get a weird error (traceback below). If I use a regular list there's no problem. I don't think my use case is that strange, since it seems like we're meant to use TreeLists as an efficient list of trees when their taxon namespaces are all the same.

Sorry for butchering all your function names in the demo code.

Best, Niko

P.S. I figured out how to properly format code in Markdown, so no more code images! Yay!

Traceback screen shot 2016-07-06 at 11 30 10 https://cloud.githubusercontent.com/assets/7184841/16605892/31716018-436e-11e6-8b99-f73420f025f2.png

System Info OS Version: OSX 10.11.5 (15F34) DendroPy version : DendroPy 4.1.0 DendroPy location : /Users/nikoyasui/anaconda3/lib/python3.5/site-packages/dendropy Python version : 3.5.1 |Anaconda custom (x86_64)| (default, Dec 7 2015, 11:24:55) [GCC 4.2.1 (Apple Inc. build 5577)] Python executable : /Users/nikoyasui/anaconda3/bin/python3 Python site packages : ['/Users/nikoyasui/anaconda3/lib/python3.5/site-packages']

Demo Code

import dendropy as dp import string

taxa_map = dp.TaxonNamespaceMapping.create_contained_taxon_mapping bd_tree = dp.simulate.birth_death_tree cc_tree = dp.simulate.treesim.contained_coalescent_tree

c = 1 n_sp_trees = 2 n_gene_trees = 1000 n_sp = 5 n_ind = 2 # can be int or list

Ne = 10000 sp_depth = Ne*c

species = dp.TaxonNamespace(string.ascii_uppercase[:n_sp])

if "Creating a list is OK": sp_trees = list(map(lambda x: bd_tree(birth_rate=1.0, death_rate=0, taxon_namespace=species), range(n_sp_trees)))

elif "Creating a TreeList breaks everything": sp_trees = dp.TreeList(map(lambda x: bd_tree(birth_rate=1.0, death_rate=0, taxon_namespace=species), range(n_sp_trees)))

elif "Creating a TreeList from a list also breaks everything": sp_trees = dp.TreeList(list(map(lambda x: bd_tree(birth_rate=1.0, death_rate=0, taxon_namespace=species), range(n_sp_trees))))

for tree in sp_trees: for edge in tree.postorder_edge_iter(): setattr(edge, 'pop_size', Ne)

si_map = taxa_map(containing_taxon_namespace=species, num_contained=n_ind)

make_ctrees = lambda tree: list(map(lambda x: cc_tree(tree, si_map), range(n_gene_trees)))

gene_trees = list(map(make_ctrees, sp_trees))

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jeetsukumaran/DendroPy/issues/61, or mute the thread https://github.com/notifications/unsubscribe/AABmR_rkUabvU31sPyjHLsDXxwwwuT8Zks5qSxrxgaJpZM4JFtwB.


Jeet Sukumaran

jeetsukumaran@gmail.com

Blog/Personal Pages: http://jeetworks.org/ GitHub Repositories: http://github.com/jeetsukumaran Photographs (as stream): http://www.flickr.com/photos/jeetsukumaran/ Photographs (by galleries):

http://www.flickr.com/photos/jeetsukumaran/sets/

yasuiniko commented 8 years ago

@jeetsukumaran,

Thanks so much for the help! I must have misread the docs; I thought TreeList just grabbed the taxon_namespace of the first tree it saw.

As for the functional programming, it's easier for me to hold the code in my head using maps. Even though for loops might be faster to read, they take longer for to understand and are more complex than "write a function, apply it to a data structure". I also hear rumors that they are faster than list comprehensions in the latest python...

As for the denseness of the structure, I find that if you throw away your notion of how much time you should be spending staring at X lines of code, and instead think in terms of how much time you should spend thinking about how to accomplish task X, 'dense' becomes 'efficient'.