Merge changes from Rose's branch

rosemckeon commented 5 years ago

Hi @bradduthie, I think I have reproduction working! It's not very fast, but I've tweaked the default parameters so you can get a simulation with 5 generations in a reasonable time.

bradduthie commented 5 years ago

Hi @rozeykex -- just pulled and ran the defaults. This looks good, but I'm trying to figure out why reproduce is so slow. It might be bind_rows on L39? I think that bind_rows is better than rbind, but I'm not sure if the memory is managed in the same (inefficient) way. Initialising a table of known length and then adding rows could potentially speed it up a lot. Consider:

start_time <- Sys.time(); # Start the clock
dat <- rep(NA, length(1000000))
for(i in 1:1000000){
  dat[i] <- i;
}
Sys.time() - start_time; # I got a third of a second here

# Below takes several minutes, at least.
dat <- NULL;
for(i in 1:1000000){
  dat <- c(dat, i);
}

The c above is appending to the vector with a similar way to how rbind (and maybe bind_rows) works. I'll take a closer look here soon!

rosemckeon commented 5 years ago

Thanks @bradduthie, I've used bind_rows quite a lot as dplyr said it was efficient on big datasets. I'm certain it can be faster though! I got a bit tangled up in the reproduction functions, so there may be a few places I've gone a long-winded way round things?

bradduthie commented 5 years ago

@rozeykex I suspect it's probably relatively efficient when bind_rows is called to reorganise data sets, especially relative to rbind. My fear is that, when calling multiple (potentially hundreds to thousands of) times within a simulation, it is still orders of magnitude slower (no exaggeration) than would be possible if the table could somehow first be built and then filled in subsequently. Would it be possible to figure out in advance how many rows need to be built here in pop_out -- e.g., some function that first sees how many gametes are going out (e.g., nrow(pop) * N_gametes)? If you can build pop_out with the correct number of rows, then fill in the relevant rows within the loop as you go, it might cut way down on the time (assuming this is the slow bit).

rosemckeon commented 5 years ago

Thanks @bradduthie, that sounds feasible. I was thinking of adding more output messages throughout the reproduction functions to see where it hangs the most. Then I will have a bash at changing how those tables are filled. I'd wondered if it was the way gametes are made but there are no bind events in there so I imagine you're probably right about it being where they are output at L39. At the moment gametes are all made, and then some are removed to simulate pollen loss. Maybe by deciding how many pollen to bother making first, I could save some time there too?

bradduthie commented 5 years ago

@rozeykex I think this is a good idea! Especially since things just started slowing down a bit; if you can pinpoint most of the inefficiency to a few places in the code, then we can potentially rework the area to make things much faster (worst case scenario, we can add a bit of C or C++). And if you can somehow check to see how much pollen needs to be made before it is removed to simulate pollen loss, this would probably make a difference, especially if a lot of pollen is typically (and randomly) removed.

rosemckeon commented 5 years ago

@bradduthie I've been playing with outputting the time each part of reproduction takes (on the loop_improve branch) and the bulk of it is in sampling the alleles, not storing them as gametes. Putting bind_rows within do.call didn't change the speed noticeably (I've left both methods in so you can flip between them if you like and see). If you pull that branch and run a couple of generations you'll be able to see now that it's the choosing of alleles that takes up most of the time creating gametes. And that creating gametes takes up most of the time of reproduction. You'll need to install a package first to see the messages.

devtools::install_github("jabiru/tictoc")
library(tictoc)
sim <- disturpoloidy(generations = 2, N_gametes = 10)

I'll change it to create only those that go on to fertilise each other, but have I gone about sampling the alleles for gametes in a weird/overly complex way? In choose_alleles I gather every row of the genome to sample 1 at random from each row to be at the same locus in the new gamete. So each gamete ends up having a slightly different selection of alleles, that remain at a specific locus, and are a mixture of paternal and maternal genes.

I've tried taking out the for loop in create_gametes (here) that calls choose_alleles, and this makes everything 10 * faster. But I can't figure out how to have all my gametes using a different sequence of alleles without the for loops.

rosemckeon commented 5 years ago

I've tried replicate(N, choose_alleles(genome) and that works but put the time back 10 * slower again.

rosemckeon commented 5 years ago

@bradduthie Nevermind the above (unless I have gone about it in a weird way!) Think I've fixed it. I've sped up the sampling of alleles by about 100 times but kept the same method and variation as described above. Choosing N_gametes > 100 is now feasible so I'll merge that loop_improve branch into Rose.

bradduthie commented 5 years ago

@rozeykex -- nice! Feel free to get rid of loop_improve then.

rosemckeon / ploidy

Merge changes from Rose's branch #29