tgvaughan / MASTER

A versatile simulation engine for stochastic population dynamics models.
http://tgvaughan.github.io/MASTER
GNU General Public License v3.0
22 stars 8 forks source link

Sample lineages from different populations #32

Closed sdwfrost closed 10 years ago

sdwfrost commented 10 years ago

Dear @tgvaughan

This is more of a feature request, and hence can be overlooked! I've been looking at structured populations, and I'd like to sample specific numbers of lineages from different populations. I can do this by reading in the full tree, then sampling from that, but my R code to process MASTER trees is very slow (listed below, FYI)

Is it possible to extend LineageSampler to give conditions for the lineages of different populations?

A related issue would be to give conditions such that the number of lineages always exceeds the requested number of lineages. Is there a way to do this?

process.master <- function(nexfilename){
  nexfile <- scan(nexfilename,what=list(character()),quiet=TRUE)
  # find which lines contain "tree " using grepl
  #treeidx <- unlist(lapply(nexfile[[1]],grepl,pattern="tree "))
  treeidx <- (nexfile[[1]])=="tree"
  nlines <- length(treeidx)
  treelabels <- as.integer(unlist(lapply(strsplit(nexfile[[1]][seq(1,nlines)[treeidx]+1],"_"),"[",2)))+1
  # count the number of trees
  numtrees <- sum(treeidx)
  # calculate start and end lines for each tree
  start <- seq(1,nlines)[treeidx]
  end <- c(start[2:length(start)]-1,nlines-1)
  treelist <- list()
  for(i in 1:numtrees){
    enwk <- nexfile[[1]][start[i]:end[i]][4]
    # delete reactions and times
    nwk <- gsub(",reaction(.*?)\\]","",enwk)
    # delete isolated times
    nwk <- gsub(",time(.*?)\\]","",nwk)
    # get rid of type info
    nwk <- gsub("\\[&type=\"","",nwk)
    # get rid of quotes and newlines
    nwk <- gsub("\"","",nwk)
    nwk <- gsub("\n","",nwk)
    tree <- read.newick(text=nwk)
    if(length(tree$tip.label)>1){
    tree <- collapse.singles(tree)
    }else{
      tree <- NA
    }
    treelist[[treelabels[i]]] <- tree
  }
  treelist
}
tgvaughan commented 10 years ago

Hi @sdwfrost

In response to your first question: I really like the idea of the structured tree extension, and will definitely add this. I agree it's best to do this in MASTER; string processing in R is computationally expensive. In the mean-time, perhaps it would be better to use Python/perl to do the processing you're doing above? Or is it APE's parsing of the Newick files that's causing the slow-down?

Regarding your second: the closest related feature that currently exists in MASTER is the leafCountEndCondition, which stops the simulation once a particular number of leaves has been reached. It's currently fairly basic, however, and you can't specify a population that the leaves must belong to or a reaction that they must be related to.

Actual post-simulation conditioning on leaf count doesn't currently exist at all, unfortunately. However, there are a number of people asking questions about this (myself included!) so I definitely want to add this in before the next release.

Thanks for your suggestions!

sdwfrost commented 10 years ago

Hi @tgvaughan

Thanks for considering the feature. I coded up the text processing in Python, which speeded things up considerably. I also have some code to plot output, plus some more example models. If you think these might be useful (once I've worked out all the kinks), I'd be happy to submit a pull request.

tgvaughan commented 10 years ago

Hi @sdwfrost. Yes, definitely send me a pull request if you have some nice examples! I'd prefer to keep the plotting code outside of the main repository, though. (I have one little R script in there, but that's used in an online tutorial.) Maybe in time we can set up a "contributed utilities" repository though?

tgvaughan commented 10 years ago

Hi again @sdwfrost, I've just put out a new release with a modified LineageSampler that I think should do what you want. I've also included a new LeafCountPostSimCondition that allows you to condition the simulation based on the leaf count. (This conditioning is performed after any sampling/filtering.)

I've updated the wiki page for LineageSampler to describe the new version, and there's a new LeafCountPostSimCondition page which can be reached from either of the InheritanceTrajectory or InheritanceEnsemble pages describing the new conditioning. Also, the following example input files (found in examples/) might be helpful:

I think this closes the issue, but feel free to reopen it if this doesn't fully address your feature request.

ekankaka commented 1 year ago

Hi @tgvaughan , is it possible to sample a transmission tree at multiple timepoints using the inheritance post processor, e.g. if i want to sample 10 leaves at sampling times 10 and 20 each, and produce a tree with the sampled leaves and their ancestors?

    <inheritancePostProcessor spec='LineageSampler'
                              nSamples="10"
                              **samplingTime="10,20"**
                              reverseTime="false"
                              noClean="false"/>
tgvaughan commented 1 year ago

Hi @ekankaka, as I mentioned in the issue you raised, I'd try remaster for this instead. The documentation is here.