Closed juliendiot42 closed 4 months ago
Great!
@timflutre
This PR is rather big sorry.
I don't expect a full code review, however, I would appreciate if you could check mainly the part about the phenotype simulation.
Especially the general documentation about the generative model ("Quantitative traits simulation (T1 and T2)"). Here I changed a bit the notation in regards of what is in the "setup report": $\sigmay$ instead of $\sigma\alpha$ as I found it more straightforward to understand, Y for Year.
Also I would have 2 questions
First do you think the data generation for the box plot are correct ?
The data of the box-plots on the right are generated as follow: (cf. src/fun/module_gameInit_params.R L454):
rnorm(n_inds, 0, sqrt(sig_a))
rnorm(1, 0, sqrt(sig_y))
rnorm(n_inds, 0, sqrt(sig))
With the default parameters we have (for trait 1):
However, in the setup reports section "7.7 Simulate traits 1 and 2" it shows:
## var(G.A[,1]) = 181
I know that the "observed" variance of the genetic effects will depend on the actual values of the simulated genotypes and may not be exactly the same as $\sigma_a^2 = 100$ but wouldn't this difference be too big ? (+80%)
Also, to give an visual idea of the effect of the "pleiotropy parameters" I wanted to show a simulation of this plot we have in the setup report:
I tried to find a way to calculate $Cov(G{Trait 1}; G{Trait 2})$ (that I could use to generate jointly the Gs with a multi-normal distribution) from only the information we have but without success... Do you know if there is a formula that exists that can give this value based on the proportion of pleiotropy and the pleiotropy correlation ?
What I did (which is wrong) was:
$$Cov(G_1; G_2) = Cov( {\sumi}^{n{snp}} g_i {\beta_i}^{t1}; {\sumj}^{n{snp}} g_j {\beta_j}^{t2} )$$
I don't consider the $g$ as random variables (I am thinking that I want this covariance for the initial population so their genetic value are already fixed... I may be wrong here)
$$Cov(G_1; G_2) = {\sumi}^{n{snp}}{\sumj}^{n{snp}} g_i g_j Cov({\beta_i}^{t1}; {\beta_j}^{t2}) $$
But $Cov({\beta_i}^{t1}; {\beta_j}^{t2}) = 0$ if $i \neq j$
So: $$Cov(G_1; G_2) = {\sumi}^{n{snp}} g_i^2 Cov({\beta_i}^{t1}; {\beta_i}^{t2})$$
$Cov({\beta_i}^{t1}; {\betai}^{t2}) = Cov{pleio}$ for $Prop_{pleio}$ % of the markers else it is 0.
Since I don't know the $g_i$ actual values, I was thinking to use $g_i = 1$ as I guess it would be its average value... (I am not very confident on that).
So finally:
$$Cov(G_1; G2) = n{snp} Prop{pleio} Cov{pleio}$$
$$Cor(G_1; G2) = \frac{n{snp} Prop{pleio} Cov{pleio}}{{\sigma_a}^{t1} {\sigma_a}^{t2}} $$
But when I test with the values in the setup report I found a correlation < -1
Also the (co)variances of the $\beta$ depends on the allele frequencies ($\sigma_\beta^2 = \frac{\sigma_a^2}{2 \sum_p f_p (1 - f_p)}$) that I don't know when the user setup the game... Do you know if I can replace the frequencies (or related) with particular "expected values" ?
You are probably quite busy and I may ask too much, but I would appreciate if I could get some directions, maybe what could help me would be a bit more information about how we can show that:
It can be shown that:
$$\mathbb{E}[X X^T] = A \left( 2 \sum_p f_p (1 - f_p) \right) + \boldsymbol{1} \boldsymbol{1}^T 4 \sum_p f_p^2$$
In section "7.4.3 Genotypic models" of the setup report.
I could ask Jacques too.
Thank you very much.
Ok, I'll try to look at it
"I found it more straightforward to understand, Y for Year." -> But then you are using $y$ for yield and $y$ for year...
Q1: "First do you think the data generation for the box plot are correct ?"
-> Shouldn't T1 and T2 be negatively correlated? In my initial code, I used my function simulSnpEffectsTraits12
which takes cor.pleio
as argument.
-> Moreover, when you multiply the SNP genotypes with the SNP allelic effects, did you center the SNP genotypes as I did here?
"I know that the "observed" variance of the genetic effects will depend on the actual values of the simulated genotypes and may not be exactly the same as but wouldn't this difference be too big ? (+80%)"
-> Did you check it, say, 100 times? Is this positive bias systematic? Did you center the SNP genotypes (see above)?
Moreover, how where the SNP genotypes generated? Should they be at HWE or did you put some pop structure? Compute the VanRaden estimator of the additive genetic relatinships. Under HWE, the diagonal should be centered on 1 and the off-diagonal on 0. But if you simulated with a bit of structure, then the mean of the diagonal may be different than 1 and the mean of the off-diagonal may be different than 0. So try with another estimator, namely the NOIA, because it does not require HWE (I added it in my function estimGenRel
). In another project, I simulated SNPs with the coalescent allowing for structure (low migration between pops) and I had to use the NOIA estimator to avoid small issues like that.
I will answer your 2nd question later.
But then you are using for yield and for year...
I rather speak about "trait 1" and "trait 2" but indeed this could be confusing too.
In my initial code, I used my function
simulSnpEffectsTraits12
Moreover, when you multiply the SNP genotypes with the SNP allelic effects, did you center the SNP genotypes as I did here?
My idea here is to be able to show how the phenotypes will look like (or what the user could expect) dynamically when the user "play" with the parameters, so I need something that can be executed quickly (~0.1s, <<0.5s). Therefore I generate the genetic values directly rather than simulating some genotypes, some SNP allelic effects and calculate the genetic values (it would be too long, especially the genotype simulation, it takes ~20s in the setup script).
Shouldn't T1 and T2 be negatively correlated? Currently (ie. to show the box plots for each trait separately), it is not necessary as the correlation will not be visible in those plots. However, it would be better indeed! That's the purpose of my second question, I would like to be able to generate the correlated genetic effects but from a multi-normal distribution directly (eg. with
MASS::mvrnorm
) and then show a scatter plots of the genetics values for each traits.Did you check it, say, 100 times? ...
I just noticed that from the the "game setup report" (that takes about 1min 30s to execute). I didn't write any code that could check that automatically. I can do it but it may not be a priority over other improvement of the game. I will at least check it manually few times.
For the other questions, I didn't touched to the game initialisation setup script so it is as you have designed it (a long time ago ^^). However, I use rutilstimflutre
at commit 42660be.
Ok, I understand better now. it's just to help visualize. So forget what I wrote, and what you did ("I generate the genetic values directly") is fine by me.
About Q2, if again you just want to show some pedagogical plot, then the easiest would simply to draw from MASS::mvrnorm with a covariance matrix computed from cor.pleio. I think it's fine since it is only an example and not the real data.
simply to draw from MASS::mvrnorm with a covariance matrix computed from cor.pleio. I think it's fine since it is only an example and not the real data.
We just have to figure out how to compute this covariance matrix from cor.pleio
and prop.pleio
because this will have an effect too.
I'm thinking out loud, here.
$Y = X_0 B_0 + X_p B_p + E$
For a given individual $i \in {1,...,N}$, we want $cov(Y{i1}, Y{i2})$ where $Y_{it}$ is the phenotype of individual $i$ for trait $t$.
Let's ignore the residual error, thus: $cov(Y{i1}, Y{i2}) = cov(X{0,i} B{0,1} + X{p,i} B{p,1}, X{0,i} B{0,2} + X{p,i} B{p,2})$
We then have the formula of a covariance for a linear combination and we also use the facts that only the effects of the $Pp$ SNPs have a non-zero covariance between traits: $cov(Y{i1}, Y{i2}) = cov(X{p,i} B{p,1}, X{p,i} B_{p,2})$
We also assume that, for two different pleiotropic SNPs, there is no covar btw their allele effects, thus: $cov(Y{i1}, Y{i2}) = \sum_{j=1}^{Pp} cov(X{p,ij} B{p,j1}, X{p,ij} B_{p,j2})$
Now we use the fact that the X's are considered fixed: $cov(Y{i1}, Y{i2}) = \sum_{j=1}^{Pp} X{p,ij}^2 cov(B{p,j1}, B{p,j2})$
We can compute $P_p$ using cor.p
and we choose the value of the covar btw allele effects of a given SNP j on both traits, so we can draw $P_p$ such values.
Oh wait, we should also draw $P0$ allele effects with no covar btw traits and not forget to add them to $cov(Y{i1}, Y{i2})$ because their cov will be zero only on average, hence they will still contribute a little bit to $cov(Y{i1}, Y_{i2})$, notably to dilute the influence of the cov of the $P_p$ SNPs. The smaller cor.p
, the larger the dilution.
The only remaining difficulty here I think is to quickly generate the values for $X{p,ij}^2$. Start by looking at their distribution on the coalescent-simulated data (add a plot as a the comment), and maybe you can approximate it quickly. That is, maybe you can find a quick way to draw $X{p,ij}$ values $P_p$ times from whatever distribution (maybe using a Beta?) so that the vector of $X_{p,i}^2$ will look like the values generated by the coalescent.
Does it help?
p.s.: the formula in section 7.4.3 comes from Habier et al (2007); you can also look at the course by A. Legarra
Does it help?
Yes thank you very much !
Start by looking at their distribution on the coalescent-simulated data (add a plot as a the comment)
For the population generated with by coalescence 1000 individuals and 36414 markers we have:
For the centered values we have this distribution:
This is quite stable as we have a lot of individuals and markers.
And for the initial population given to the breeders (1000 hapo-diploidisations from the above mentioned population) we have:
For the centered values we have mostly a similar distribution (visually).
Also, for what I said:
However, in the setup reports section "7.7 Simulate traits 1 and 2" it shows:
## var(G.A[,1]) = 181
And your reply
Did you check it
I tested a bit and found that for the population generated with coalescence we have indeed their genetic variance approximately equal the the theoretical values. But for the initial population given to breeders ("Coll"), which come from haplodiploidization of the "coalecence population", their genetic variance is approximately equal to 2 times the theoretical values.
From your comment, I have made an implementation in https://github.com/timflutre/PlantBreedGame/commit/c3f3bec8ed80850fc6e85aadc7d340783742fd62
However, since we would still have to generate the genotypes for such calculation, I opted for an approach where I generate the genotypes (this is the long calculation), the marker effects (this is rather fast) and calculate the genetic values from that (rather fast).
To quickly generate the genotype I first draw some alleles frequency (AF) with a beta distribution and then deduced the genotypes. But I could only do it for the "Coll" population. Since they are homozygous, (so their genotypes for each markers is only 0 or 2) given the number of individuals and the AF I can deduce the number of individuals with genotype 0 or 2. (When I write this message, I realise that maybe I could have done it for the previous population too by simply "create a big urn" with the 2 alleles in their correct proportion and let each individuals draw 2 of these alleles. I didn't thought of that when I was coding, but since the breeders will start with the "Coll" population I think it is ok to show how this "Coll" population will look like rather that their parent population).
For the parametrisation of the beta distribution, I looked on the AF distribution in the setup report. For the "Coll population" the estimated beta distribution was almost with parameters 0.5/0.5 so I used this parametrisation:
quick_afs_simul <- function(n_marker_sim = 15000, shape1 = 0.5, shape2 = 0.5, seed = 42) {
saved_seed <- .GlobalEnv$.Random.seed
set.seed(seed)
sim.afs <- rbeta(n_marker_sim, shape1, shape2)
.GlobalEnv$.Random.seed <- saved_seed
sim.afs
}
Then I simulate the genotypes with:
quick_geno_simul <- function(afs, n_inds = 1000, seed = 43) {
saved_seed <- .GlobalEnv$.Random.seed
set.seed(seed)
geno_centered <- sapply(afs, function(af) {
n <- round(n_inds * af)
sample(c(rep(2 - 2 * af, n), rep(-2 * af, n_inds - n)))
})
.GlobalEnv$.Random.seed <- saved_seed
geno_centered
}
This function is still a bit long to run (~3s on my hardware). The main problem is that I loop (with the sapply
) over each markers (any optimisation suggestion is welcome ^^). However this function is called only once, since it do not depend on any UI parameters. So the plots will takes ~3s to appear the first time. Imho, this is acceptable but I don't know how it will be on other hardware (especially cloud servers).
And the marker effects are simulated with:
quick_g0_simul <- function(T1_sig_a2,
T2_sig_a2,
afs,
prop_pleio,
cor_pleio,
geno,
seed = 44) {
if (!is.null(valid_variance(T1_sig_a2))
|| !is.null(valid_variance(T2_sig_a2))) {
return(NULL)
}
saved_seed <- .GlobalEnv$.Random.seed
set.seed(seed)
sigma.beta2 <- c(T1_sig_a2, T2_sig_a2) / (2 * sum(afs * (1 - afs)))
n_snp <- length(afs)
n_pleio <- round(n_snp * prop_pleio)
n_non_pleio <- n_snp - n_pleio
Sigma.beta.nopleio <- matrix(c(sigma.beta2[1], 0,
0, sigma.beta2[2]),
nrow = 2, ncol = 2)
cov.pleio <- cor_pleio * sqrt(sigma.beta2[1] * sigma.beta2[2])
Sigma.beta.pleio <- matrix(c(sigma.beta2[1], cov.pleio,
cov.pleio, sigma.beta2[2]),
nrow = 2, ncol = 2)
# I don't use MASS::mvrnorm and use a "manual approach" here so that the
# generated values remain the same more or less the covariance structure
# whatever the provided parameters. This will help the visualisation as
# the cloud point will be deformed.
base_random_values <- matrix(rnorm(n_snp * 2), 2)
if (n_non_pleio != 0) {
R_nopleio <- chol(Sigma.beta.nopleio)
beta_nopleio <- t(R_nopleio) %*% base_random_values[, 1:n_non_pleio]
} else {
beta_nopleio <- c()
}
if (n_pleio != 0) {
R_pleio <- chol(Sigma.beta.pleio)
beta_pleio <- t(R_pleio) %*% base_random_values[, (n_non_pleio+1):n_snp]
} else {
beta_pleio <- c()
}
beta <- t(cbind(beta_nopleio, beta_pleio))
.GlobalEnv$.Random.seed <- saved_seed
geno %*% beta
}
(I don't use simulSnpEffectsTraits12
for that because this function make some tests and have some requirement that are not necessary in this context, and also because of my comment about MASS::mvrnorm
)
With this, I visually do not see much differences from the code in the setup.
Thank you for your help ! :bow:
"With this, I visually do not see much differences from the code in the setup." -> ok
I don't have the time now to check all this, but I read through what you did and it looks fine.
The most important is to use this shortcut only for visualization purposes. The actual SNP genotypes should still be generated with the sequential coalescent with recombination, as it also generates LD btw SNPs, which is important for genomic prediction.
This PR implements the ability to initialise the game with different parameters than the default ones.
The game initialisation script have a lot of parameters and not all of them have been implemented yet. This PR implements:
To have a rather concise UI, each group of parameters are encapsulated in collapsible sections that extend when clicked:
The documentation about the role of each parameters is done by providing some general guidelines at the beginning of each section (once expanded), and by clicking on the labels of each parameter, for example (for the RNG seed):
Moreover, some information related to the current set of parameters are shown on the right hand side in order to help the user in their choices (cf. the section showing the implementation of each group of parameters below).
In order to avoid crashes of the game initialisation script as much as possible, each parameter is validated in real time by the application. In case some parameters are invalid, they will appears in red along with a feedback message explaining what is wrong.
If any parameters is invalid, it will be impossible for the user to trigger the game initialisation (the button is automatically disabled)
Cost related parameters:
The right part show a summary of the provided cost in terms of "phenotyping plot" and in "Mendels" (the game's money).
Phenotyping simulation parameters:
The right part show the values of different variances of the generative model calculated with the current set of inputs. Along with 2 box plots (one for each trait) that show what the phenotype of the initial population can look like. (similar to one of the plot in section "7.7 Simulate traits 1 and 2" of the setup report).
These values are quickly generated by the application following the current parameters but do not correspond to what will be actually generated by the initialisation script (as mentioned below the plots).
One important things to keep in mind is the game initialisation script is rather complex and it is possible to have a set of apparently "valid" parameters (eg. heritability is between 0 and 1 etc...) that leads to invalid "simulations parameters" (ie. other values calculated by the initialisation script and used for the simulation, for example the variances of the generative model in this case).
Here, I check that all those variances are acceptable (ie. stictly positive). Those that are not appears in red and the initialisation can not be triggered.