Suggestion: sampletree supports 'shortest' and 'longest'

richelbilderbeek commented 6 years ago

Currently, sampletree supports the sampling methods random, oldest and youngest. I suggest to add shortest (sampling the shortest branch lengths) and longest (sampling the longest branch lengths), as this would provide for consistently shorter and longer branch lengths.

Problem

Imagine being interested in the effect of branch length distributions due to sampling. The two simplest and just as likely scenarios in which sampling has an effect are P1=(A1, (B, A2)); and P2=(A1, B), A2));. For P1, sampling youngest results in shorter branch lengths, where for P2 this results in longer branch lengths.

See below for a detailed example.

richelbilderbeek commented 6 years ago

P.S. I volunteer to add it, test it and port the documentation from LaTeX to roxygen2.

richelbilderbeek commented 6 years ago

I've tested the reasoning below to be true, and checked with @RaphSch. Rmd and PDFs are here: pbd_sampling.zip. Note that I suggest MRS (Most Recent Sister) and MDS (Most Distant Sister) nowadays.

This is an incipient phylogeny in which everything works as expected:

P1

          +---+---+---+ 1-4
  +---+---+
  |       +---+---+===+ 3-3
 -+
  |
  |   +---+---+===+===+ 2-2
  +===+
      +===+===+===+===+ 1-1

Using the youngest, that is, pick taxon 1-4 to represent species 1, results in the shorter branch length distribution:

P1: YOUNGEST (should have shorter branches)

          +===+===+===+ 1-4 (YOUNGEST)
  +===+===+
  |       +===+===+===+ 3-3
 -+
  |
  +===+===+===+===+===+ 2-2

P1: OLDEST

  +===+===+===+===+===+ 3-3
 -+
  |
  |   +===+===+===+===+ 2-2
  +===+
      +===+===+===+===+ 1-1 (OLDEST)

Now, we reverse the times at which species 2-2 and 3-3 started speciation, that is, when they started being incipient species (note that they finish speciation in the same order):

P2
      +---+---+---+---+ 1-4
  +---+ 
  |   +---+---+---+===+ 3-3
 -+
  |
  |       +---+===+===+ 2-2
  +===+===+
          +===+===+===+ 1-1

Now we see that oldest has the shorter branches:

P2: YOUNGEST (should have shorter branches)

      +===+===+===+===+ 1-4 (YOUNGEST)
  +===+ 
  |   +===+===+===+===+ 3-3
 -+
  |
  +===+===+===+===+===+ 2-2

P2: OLDEST

  +===+===+===+===+===+ 3-3
 -+
  |
  |       +===+===+===+ 2-2
  +===+===+
          +===+===+===+ 1-1 (OLDEST)

This is caused -more or less- by that the algorithm PBD::sampletree orders taxons by their speciation initiation time (3rd column in the L table).

To get a consistently shorter branch length distribution, I will suggest to add mrca (Most Recent Common Ancestor) and mdca (Most Distance Common Ancestor) as a sampling method to the PBD package:

P1

          +---+---+---+ 1-4
  +---+---+
  |       +---+---+===+ 3-3
 -+
  |
  |   +---+---+===+===+ 2-2
  +===+
      +===+===+===+===+ 1-1

P1: MRCA (shorter branch length distribution)

          +===+===+===+ 1-4
  +===+===+
  |       +===+===+===+ 3-3
 -+
  +===+===+===+===+===+ 2-2

P1: MDCA

  +===+===+===+===+===+ 3-3
 -+
  |   +===+===+===+===+ 2-2
  +===+
      +===+===+===+===+ 1-1

And for the other phylogeny:

P2
      +---+---+---+---+ 1-4
  +---+ 
  |   +---+---+---+===+ 3-3
 -+
  |
  |       +---+===+===+ 2-2
  +===+===+
          +===+===+===+ 1-1

P2: MRCA (shorter branch length distribution)

  +===+===+===+===+===+ 3-3
 -+
  |       +===+===+===+ 2-2
  +===+===+
          +===+===+===+ 1-1

P2: MDCA

      +===+===+===+===+ 1-4
  +===+ 
  |   +===+===+===+===+ 3-3
 -+
  +===+===+===+===+===+ 2-2

rsetienne / PBD

Suggestion: sampletree supports 'shortest' and 'longest' #17

Problem