robjohnnoble / ggmuller

Create Muller Plots of Evolutionary Dynamics
Other
64 stars 9 forks source link

The `get_Muller_df` function fails with large datasets. #12

Closed cdeitrick closed 5 years ago

cdeitrick commented 5 years ago

The get_Muller_df function fails when given a dataset with a very large number of timepoints, such as a population from the Long Term Evolution experiment. The error reported when using a population with ~170 timepoints is:

> Muller_df <- get_Muller_df(edges, population)
Error in `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels,  : 
  factor level [335] is duplicated

Removing most of the timepoints allows the script to work again, while removing mutations from the source file has no affect. I have also checked for duplicate datapoints in the source files (attached) but found none.

m5_correct.ggmuller.edges.txt m5_correct.ggmuller.populations.txt

robjohnnoble commented 5 years ago

Thanks for reporting this issue. I'll take a look as soon as I can, though it might take a bit longer than usual as I'm about to leave for a week-long conference. My initial guess is that the problem isn't due to the number of timepoints, or else I would have encountered it with my own data.

robjohnnoble commented 5 years ago

The problem seems to be due to the add_start_points function, which is called by get_Muller_df. If I comment-out the line that calls add_start_points then the error disappears.

I've yet to figure out exactly why this happens with your data but I don't think it's due to "large datasets". Rather, I suspect it might be because some population sizes change from positive to zero and then back to positive -- a behaviour I didn't anticipate when I wrote the code.

robjohnnoble commented 5 years ago

It's definitely not due to the number of time points because the error occurs after I filter the data to just two time points (using dplyr): population <- filter(population, Generation %in% c(0, 10000)).

robjohnnoble commented 5 years ago

Until I figure out the exact cause, I suggest you modify the get_Muller_df by commenting-out the following line: pop_df <- add_start_points_alt(pop_df, start_positions). Then it should work.

robjohnnoble commented 5 years ago

It turns out the error occurs only for population data frames with a very particular characteristic (one or more new populations appear at exactly generation 10,000). The bug was due to how the "add_start_points" function adds new rows to the population data frame. I've fixed it with commit 2a68df7916d500de9020868b6817acea0578cf40.