graemetlloyd / metatree

An R package for compiling meta-analytical trees (phylogenies).
3 stars 1 forks source link

We should compare metatree::PaleobiologyDBTreeBuilder and paleotree::makePBDBtaxonTree #2

Open dwbapst opened 5 years ago

dwbapst commented 5 years ago

Would be good to know how two independently coded attempts at pulling taxon-trees from the PBDB differ (or not) in their behavior. I'm constantly worried that there might be additional taxonomic information my current algorithm is missing...

graemetlloyd commented 5 years ago

Sure! Although as best as I can tell your function requires a separate download first and then builds the tree from that?

Mine builds one directly from the database. This means currently it doesn't do anything more than get a topology - no time-scaling. Although I would like to add this.

dwbapst commented 5 years ago

Not exactly - my function does require pre-existing taxonomic downloads using the /taxa call, so there is that initial step of getting that data (but just means your function just wraps what two of my functions do). And it can work solely from that downloaded table - but preferably you have live access so you can trace when taxa are (somehow) missing their parent in the taxon download (which shouldn't happen if you downloaded an entire clade, but does).

But yeah we should look at how many taxa are recovered by both, how resolved the trees are, etc...

graemetlloyd commented 5 years ago

"preferably you have live access so you can trace when taxa are (somehow) missing their parent in the taxon download (which shouldn't happen if you downloaded an entire clade, but does)"

Oh boy...

So this shouldn't happen AFAIK. There definitely are what I call "orphans" in the database - my other function (metatree::PaleobiologyDBTaxaQuerier) will stop and warn the user if these are found. So I guess you are querying the database in a very different way to me?

dwbapst commented 5 years ago

Well, yes, it does happen, and it happens when a parent node is incorrectly invalid, or a taxon is its own parent via synomization, so if you ask for all the valid taxa in a given clade, you might be missing some taxon's parent node. I think I discuss this at length in the paleotree doc - it was quite a headache until finally I coded the 'live updating' algorithm to pull information even for invalid parent taxa so to complete the tree.

For what its worth, the paleotree function is actually three seperate algorithms - one that does parent-child mapping on a limited closed-loop dataset, one that does that but can live-update via the API, and one that reconstructs the tree instead based on existing higher taxon labels in the Linnean taxon categories given by the PBDB. The preferred one is the live-update one.

graemetlloyd commented 5 years ago

Hmmmm. Well that might be an issue then as my code only returns valid taxa, either as tips or nodes.