Run the simulations - Githubissues

rosemckeon commented 5 years ago

May be worth looking at how to schedule these on the server by making a repeating job that sets itself going again once complete.

https://stackoverflow.com/questions/18306362/run-r-script-from-command-line https://linux.die.net/man/1/rscript https://www.interserver.net/tips/kb/task-scheduling-linux-command-line/ https://www.taniarascia.com/how-to-create-and-use-bash-scripts/

https://askubuntu.com/questions/8653/how-to-keep-processes-running-after-ending-ssh-session

bradduthie commented 5 years ago

@rozeykex Should work -- can also use a BASH loop, I think.

rosemckeon commented 5 years ago

Noting here what we discussed today about the simulations we'll run @bradduthie (once I've found the sweet spot for the NULL simulations I'll let you know here too).

[x] Run NULL tests first to see what pattern emerges if there is no advantage to being polyploid and all individuals can mate freely with an equal chance of success.
[ ] Vary ploidy_growth_benefit.
[ ] Vary disturbance_freq.
[ ] Vary disturbance_mortality_prob
[ ] Vary pollen_range
[ ] Vary clonal_growth

Consider varying seedling_selection_constant later to change the strength of selection against size.

rosemckeon commented 5 years ago

Hi @bradduthie, I'm starting to get some data in now. Haven't got as many runs in over the weekend as I wanted, but I think I've got a good stable NULL point now that doesn't take too long to run. I've saved it here as a script you can run with Rscript simulations.R from a unix/BASH prompt. There's an analysis file with some quick plots to check the pop levels in the data once it's finished.

Seed survival seems to be the key parameter I've had to tweak gently to get population levels right. In my tests, over the weekend, I had this a little higher at 0.535 and all 4 sims got to 100 nicely but it looked like some were about to boom and go wild. The set of data I pushed yesterday was with that set a little lower (0.5325) and there was too much extinction. I have some running right now at 0.534 and they seem to be doing well I've got 1 of 2 sims that have reached 200.

bradduthie commented 5 years ago

Hi @rozeykex -- sounds good. I'll start processes of this in parallel and see how it goes!

rosemckeon commented 5 years ago

Oh, @bradduthie! I should also have said: I have runs <- 2:4 being covered. If you do 5+ and let me know your top limit then we can make sure we don't push conflicting data and log files. I've kept sims set as 4 as it's a nice number to have in one data file for facets and filesize.

bradduthie commented 5 years ago

Ah! @rozeykex I was actually just copying the whole directory locally four times and running in parallel to get the data. Should I also try a 5:7 assigned to runs to see how that works?

rosemckeon commented 5 years ago

Yep! @bradduthie I've made it so it outputs everything pretty neatly. Rather than copying the folder many times, to run many in parallel you could just copy the simulations file and have different values assigned to runs. Then all the output will just organise itself within the analysis/data/ folder.

bradduthie commented 5 years ago

@rozeykex Okay, cool! Do I just need to change runs <- 5:7, simulate, and then push then?

The local version is running fine -- Simulation 3 Generation 108.

rosemckeon commented 5 years ago

@bradduthie sorry, been away from the computer! Yep. That will make sure the .rds files and log .txts that you push are all named differently to mine. The data objects reference the logs that go with them.

bradduthie commented 5 years ago

@rozeykex Okay, will start on this tomorrow morning!

rosemckeon commented 5 years ago

@bradduthie I've added runs 8:11 and set 12:15 going with a bit higher seed survival.

rosemckeon commented 5 years ago

Notes from today's meeting with Mario: NULL sims need to model a perennial iteroparous plant system more accurately and remove the seed bank so genomes created in the past are not appearing later to confuse patterns. We'll do this by:

Having seeds germinate or die (no dormancy).
Having adults survive the winter.
Having adults survive at higher rates than juveniles.

From logs, current seedling_selection_constant of 0.25 equates to roughly around 27% seedlings surviving.

rosemckeon commented 5 years ago

Seedlings are now known as juveniles so parameter mentioned above https://github.com/rozeykex/ploidy/issues/61#issuecomment-516507272 now juvenile_selection_constant.

rosemckeon commented 5 years ago

@bradduthie How do these look? https://github.com/rozeykex/ploidy/commit/b1c923a780fb1caa178be1d7c3fd745d83fd1d02
I think they're better. Looks like the population becomes stable when the adults reach the carrying capacity. The juveniles now follow the seeds in number pretty closely. I've run the following in parallel this afternoon to get two pretty identical simulation plots. And it took 3 hrs to complete both to 100 generations.

disturploidy(
    pop_size = 400,
    grid_size = 40,
    juvenile_selection_constant = .1,
    adult_survival_prob = .7,
    inbreeding_sensitivity = 0,
    germination_prob = .5,
    seed_survival_prob = 0,
    ploidy_growth_benefit = 0,
    N_ovules = 25,
    pollen_range = 40,
    fertilisation_prob = .5,
    uneven_matching_prob = .5,
    selfing_diploid_prob = 0,
    selfing_polyploid_prob = 0,
    triploid_mum_prob = .5,
    disturbance_freq = 1000
  )

About 10% of juveniles survive, compared to 70% adults, so every season we have a huge influx of new juveniles. (I've checked in the logs and it seems that the selection constant does equate to rough proportion).

Does this look like NULL yet? Thinking about turning off ploidy_rate for the NULL to save computation time (if ploidy rises in the system allele sampling of seeds takes longer).

I've pushed the data and plots anyway, let me know what you think :)

rosemckeon commented 5 years ago

I may have made a fatal error... Running things in parallel how I suggested with just different simulation scripts means that every generation the data is stored to the same rda file. Damn.

bradduthie commented 5 years ago

I think that the plots look a lot better @rozeykex -- at least things are stabilising (might want to keep the sims going for at least a hundred generations after stabilisation). Three hours is not too bad, all things considered. Running a set of simulations overnight should still give you a reasonable number of replicates. The huge influx of juveniles, most of which do not make it, is probably biologically realistic (@mvallejo6 ?), and 70 per cent of adults seems like a reasonable starting point to me. Turning ploidy off at the start is probably also a good idea.

No worries about overwriting on the same rda file and losing stuff -- happens to me all the time. In the past, I've gotten around this by saving the process at the end of the file name. You can call the process number in R with Sys.getpid(), then paste it between the data output name and the .rda extension, then merge the RDA files later somehow.

rosemckeon commented 5 years ago

OK, that shouldn't have meant that the final saved data was corrupted at all. but it may have been slowing things down while two processes were constantly trying to update the same file. I'm combining our strategies now so that I have lots of clones of the repo but the simulations that are running within them are saving different file names so I can push form all of them without conflicts.

I'll take a loot at that function and come up with a better fix later! Thanks :)

rosemckeon commented 5 years ago

I'm running some now with more seeds at the start so it gets to the stable point faster, and extending the generations to 200 @bradduthie

bradduthie commented 5 years ago

@rozeykex Ah! See what you mean. Okay, that sounds good!

rosemckeon commented 5 years ago

@bradduthie I went a bit crazy over the weekend waiting for these simulations...

It seemed like lots of pointless waiting when all the seeds were just being killed and there were problems with trying to speed things up by predetermining germination (I lost actual seed output numbers and couldn't discern anything about ploidy arising in the missing seed output AND it was going to mess up anything I did to knock out germinations due to adult presence later).

SO, I changed a bunch of things:

Seeds no longer have genomes. These are filled after germination (I realised I have all the possible parents to fill genomes with the main generation data we store).
Seeds no longer germinate if adults are present.
Disturbance is properly random and affects the whole landscape (I realised we can compare sets of populations). Instead of being applied after survival, it reduces the survival chances so mortality only happens once.
I fixed an error with growth rates and ploidy benefit not being passed to the calculation function.
The data output is all rearranged again, as now we have seeds that have a different data structure, so I've separated the output by life-stage. And I've separated seeds that survive because of a seed bank from seed output due to reproduction - then F can still be determined if there is dormancy turned on.

It's been worth it because the model runs much faster (simulation below took 1.5 hrs). BUT now I've lost that nice null it looked like we had. I wasn't expecting inhibiting the germination due to adults to really change that much, but it did.

Best looking plots I have at the moment are looking like this (and I've been running sims all weekend).

Fiddling with variables to try and get the adults nearer K. Is that what we need? Or is it OK if it ticks along near the baseline?

bradduthie commented 5 years ago

Hi @rozeykex -- thanks for the update! Some thoughts on each of your five bullets:

I think this is a good idea, assuming that there is no way to have a seed whose genome cannot be determined due to parent death (i.e., that parents necessarily are alive at the latest time a seed can germinate).
I think this is good -- if we're modelling a landscape cell as being the size occupied by an adult, then I think that it is reasonable to assume that seeds landing on the cell are doomed (could be before or after germination -- doesn't matter so much; the point is that they can't become adults on that cell).
This could work, though it might make it more difficult to investigate patterns of ploidy on a landscape (e.g., is polyploidy spatially correlated with distance to disturbance?). I do think that increasing density-independent mortality is fine.
Nice!
Sounds good!

It's great that the model runs much faster now. This will help out a lot! But I'm confused by what you mean when you say that the nice null has been lost. The plots that you link to look good to me, in that the simulation has hit a point at which there don't appear to be any real long-term trends. I assume that the grey line is the technical K (1600 on a 40 by 40 landscape) for adults? If so, then no worries that the population hasn't hit K. The actual population size is also going to be decreased by density dependent and density independent sources of mortality. And effects of density dependent mortality (competition) on population growth will increase as the population size also increases, so it might well be that the adult population size is lower than K because it hits zero expected population growth due to competition at pop_size < K.

You can play around to see if the above makes sense by increasing the population growth rate (e.g., N_ovules), or dispersal distance, or decreasing density-independent adult mortality. Any of these should make the adult population size go up.

rosemckeon commented 5 years ago

@bradduthie phew! At first I just got lots of extinction so I was really hoping you'd think this looked ok for a null plot. I was a bit concerned that I can't seem to get the adults to rise to 1600 (yes, that is the grey line) but what you've said above makes sense.

I've run a bunch of simulations that gradually increase ovules, and fertilisation probabilities; they affect the adult population a little but the overall trend is very much the same. I don't get any continued growth. So I've turned ovules back down to a more efficient number.

I don't have a parameter for seed dispersal distance. But I have tried a couple of different runs changing the density-independent adult mortality. And also, the threshold size for transitioning to adulthood.

rosemckeon commented 5 years ago

Also, pretty sure I've thought through the genome filling ok.

Every generation the adults that have managed to reproduce are stored.
So any time a seed manages to germinate that dataframe of past reproductive adults can be checked for genomes to sample.

It won't really matter at the moment, but I wanted to make sure a seedbank could be turned on later if required.

rosemckeon commented 5 years ago

hey @bradduthie. Here's a - hopefully final - set of null tests with the density-independent mortality of adults varied in replicates of 4: 73 had survival very low at 0.2 (double that of seedlings) while 74 had it much higher all the way up at 0.9. In the plot, I showed you yesterday it was 0.5.

There's definitely a difference going on. I guess the question now is where do I want that figure to be for the actual null repeats? Do we want more/less new adults reproducing each year? Am I right in thinking that will affect the speed of evolution; if mostly new adults contribute to the seedoutput each year then selection will act faster?

bradduthie commented 5 years ago

Sounds like the genome filling works then @rozeykex -- excellent!

I'm surprised that increasing ovule and fertilisation probability does not affect the adult population more. I'm wondering if this is mainly due to seed dispersal then? Are individuals just not spreading across the landscape quickly enough, and limiting the population size due to high competition over a small number of cells? You might be able to test this by adding a line of code that randomises the location of a seed after it is made.

You don't need the population to hit the theoretical K, certainly, but I think that you'll want more of the landscape to be filled (maybe half of the cells). You're correct in thinking that the number of adults contributing to the next generation will affect the rate of (adaptive) evolution, since selection is stronger at larger effective population sizes. When population size is low, allele frequencies will be more affected by drift.

rosemckeon commented 5 years ago

I think I can quite quickly add in a parameter for dispersal range too. I'll do that @bradduthie and see if I can get the population size up more. You'd think it would still spread steadily all around the edges of the occupied space though...

We're nowhere near half the cells at the moment. mean adult population size is 100 give or take.

rosemckeon commented 5 years ago

@bradduthie When are you back in the UK?

After checking over the last possible problem with pollen range this morning I'm finally running my null sims. Doing 1 through 30 over 3 processes in parallel (I've upped my server specs for a week or so so I can get all this data together!). 30 will be enough of each just to get some results for the workshop right? Maybe even less?

In the null system:

No one can self (inbreeding cost is also 0 but not relevant as no-one can self anyway).

Pollen range and seed dispersal are maximum to the grid size.

No genome duplication occurs.

There is no dormancy.

Adult density-independent survival is 0.5

There is no disturbance.

Ploidy growth benefit is 0 (which I've realised actually effects diploids too. When this is 0 they get the rate based on a mean allele value taken from 2 alleles, but if this is at 1 then they get double).

My plan is to set some of the next tests going on my other computer too. SO....

Do I need to run a set with all the settings the same as NULL but with disturbance varied before I start fiddling with the costs and benefits of being polyploid?
Do I need to check each parameter that controls a cost or a benefit one at a time? And do I do this sequentially? So they kind of pile up?
Or, more simply, do I add both costs and benefits altogether, then run sets of replicates with and without disturbance on that whole scenario?

Turn on genome duplication and set both costs and benefits based on literature/theory.
Run replicates with disturbance_freq still at 0, then again with this value decreased incrementally from 200: say 100, 50, 25, 10, 5. (it roughly represents the number of generations likely to occur between disturbance events).

Then afterwards, the costs and benefits could be varied too to see which ones are most instrumental? Or, we could change the pollen and seed dispersal ranges?

Am I on the right track?

bradduthie commented 5 years ago

Hi @rozeykex -- I'll be back in the UK on Monday.

The 1 through 30 over three processes sound good. Let me know if I can start some on Monday that would make it in time. I can try overnight here on my laptop if you specify the arguments, but I'm not sure if it will work (laptop is old and sometimes overheats and reboots).

The null model looks good, as far as I can tell.

I would try to only vary one thing at a time, if you can. The reason is that if you vary two things (e.g., disturbance and polyploidy costs/benefits), then any pattern that differs from the null will be hard to interpret (e.g., was the pattern caused more by the disturbance, or more by the polyploidy benefits?).
Ideally check each parameter sequentially, but if you only have a limited amount of time, then I would zero in on one parameter of interest and pick a handful of values that are likely to span across the range of possible conditions (e.g., from zero polyploidy benefit, up to one that is unrealistically high).

Setting something to values from the literature might even be better, if you can do it. Then you could, for example, set costs and benefits equal to that in the literature, then contrast disturbance versus no disturbance (in other words, there is less need to try a lot of costs and benefits if your goal is to see how disturbance affects a realistic system of costs/benefits).
I think that the disturbance frequencies you chose sound good.

You're on the right track! It's now just a matter of figuring out what you want to learn from the model -- what to hold constant (and at what parameter values), and what to vary to compare.

rosemckeon commented 5 years ago

@bradduthie Some help running the simulations when you're back will be great, don't worry while you're away! They seem to be taking maybe 26 hrs to get to 200 gens. I think I will have to do less replicates so I have at least something for the workshop, then carry on running them to make whatever we see more solid.

I will have a go at coming up with parameter values from the literature to replicate realistic costs/benefits. We'll see. Just to make sure I've understood you, failing that, I...

Incrementally increase one benefit/cost at a time.
Apply a range of disturbances to each variant of every single cost and benefit.

So, results would perhaps show that disturbance is important when a particular cost is at a specific value or a particular benefit at a specific value.

Then in a system of many different known costs and benefits, I should be able to predict how much disturbance would impact it. Or, perhaps, even in some part the other way around? So if I knew how much disturbance there was and how many polyploids could I make educated guesses about the cost-benefit system in action?

Are you free to meet up on Monday?

rosemckeon commented 5 years ago

Benefits:

[x] ploidy_growth_benefit above 0 will increase growth rates and lead to polyploids being more successful in the competition for landscape cells.
[x] inbreeding_cost above 0 will give polyploids better winter survival chances over their diploid counterparts.
[x] selfing_polyploid_prob above fertilisation_prob or above selfing_diploid_prob, especially when pollen_range is reduced. Will increase fecundity, especially in the face of limited mate-choice.

Costs:

[x] uneven_matching_prob below fertilisation_prob will reduce the fecundity of polyploids, especially when diploid density is high, and when pollen_range is high.
[x] triploid_mum_prob below fertilisaion_prob will reduce the fecundity of triploids so less diploid gametes are released into the system.

Then I guess also check those costs/benefits in the other directions too for completeness. So, a less realistic cost would be selfing_polyploid_prob below fertilisation_prob or below selfing_diploid_prob, especially when pollen_range is reduced. inbreeding_cost and ploidy_growth_benefit can't be configured as costs, but uneven_matching_prob and triploid_mum_prob can be configured as benefits.

rosemckeon commented 5 years ago

@bradduthie I've reduced the landscape grid in above simulation scripts to 30. I hope this is OK. Going off what you said when it was 40, that we needed at least half that landscape filled with adults (so N ~ 800) to prevent too much drift. I'll have 900 adults reproducing with grid size = 30, but should dramatically increase the number of tests I can run as filesizes will be loads smaller. So far running these it looks like it's halving the simulation time - so more like 13 hrs.

bradduthie commented 5 years ago

@rozeykex I think that this should work!

rosemckeon commented 5 years ago

Hi @bradduthie, welcome back. I'm starting to get some data on the sims that test everything individually. I've been away for the weekend so getting my head properly in the literature for the realistic system now too. Please feel free to run any of the scripts in the costs/benefits folder in /simulations. If you set runs > 5 then the files we push shouldn't overlap.

I think costs/triploid_mum_prob/sterility-200.R has the most tests in and I might be running a risk of not completing all of them. It might be even better if you run tests in reverse order? - let me know if you want me to set up some scripts like that.

Also, FYI I've bitten a bullet and updated my github username to something perhaps more professional!

bradduthie commented 5 years ago

Hi @rosemckeon -- good to be back! Okay, I've pulled the branch 'rose' and will start running some simulations. I'll set runs == 10 to start with so there is no overlap. I'll happily run whatever you want in parallel on my office and home machines, if you can just send a quick script to source? I should have at least 5 cores open on my office computer, and one on my home computer.

rosemckeon commented 5 years ago

@bradduthie great! I'll push some scripts for you now, with the sims all running back to front, and tag you in the commit. I'll set up the run numbering too. I've found when the processes get killed early it's usually due to lack of memory rather than processing power. This whole thing has got me looking at 32GB RAM upgrades....

bradduthie commented 5 years ago

@rosemckeon Whoa! That is quite memory heavy. Too late for this now I suspect, but if you build on these simulations for a future project, it might make sense to get rid of unused memory during the simulation (e.g., by printing prior generations to a file, then removing them from the R environment). The gc() function might be able to free up a bit by garbage collection.

I've got 16 GB to work with here. I'll start with one run to see how things look, then start trying everything in parallel.

rosemckeon commented 5 years ago

@bradduthie I was wondering if there was something I could do differently like splitting the files up into smaller chunks! I don't really want to risk breaking things by attempting last minute speed improvements but it's tempting... I'll definitely take a look at gc() if I get to expand on this for another project.

rosemckeon commented 5 years ago

@bradduthie I've got 16GB on the server and it manages 4 in parallel most of the time, with occasional process deaths.

bradduthie commented 5 years ago

@rosemckeon Yeah, splitting files into chunks is often a good solution. I typically try to avoid storing information (e.g., old individuals) that is no longer needed in the working memory. I agree that it's probably not worth the work and risk at this point, so no worries. We should be good with some simulations on both your server and my desktop(s)!

The gc() is actually easy to work with. It doesn't require any additional arguments, and can be expressed wherever you want to free up some unused memory (e.g., at the end of each generation. It's unlikely to free up a ton of memory, but it might make things just a bit more manageable for not a whole lot of time loss.

rosemckeon commented 5 years ago

@bradduthie So I can just do gc() whenever I clear the current generation object? the reset parameter doesn't need to be TRUE?

bradduthie commented 5 years ago

@rosemckeon Yeah, I think that this should work. I think that R does this automatically to some extent, but it might help a bit.

rosemckeon commented 5 years ago

@bradduthie done :) Can you let me know if it helps enough to interrupt my simulations and start them again? I don't really have the free memory at the moment to test it.

bradduthie commented 5 years ago

@rosemckeon Will do! Just pulled and am now running 'inbreeding-200-brad.R'. Will give it a few minutes, then run the other two scripts.

bradduthie commented 5 years ago

@rosemckeon Now running growth-200-brad.R and selfing-200-brad.R. The simulation inbreeding-200-brad.R is currently on generation 66.

rosemckeon commented 5 years ago

These are happily ticking along now and managed in another repo.

rosemckeon / ploidy

Run the simulations #61