"Should predict(..) corresponds to Eq. (2) and train_disc(...)?" and a second concern regarding handling "missing genes"

rvinas / GTEx-imputation

Gene Expression Imputation with Generative Adversarial Imputation Nets

MIT License

11 stars 3 forks source link

"Should predict(..) corresponds to Eq. (2) and train_disc(...)?" and a second concern regarding handling "missing genes" #5

Closed yezhengli-Mr9 closed 4 years ago

yezhengli-Mr9 commented 4 years ago

Should predict(...) be consistent with hat x in Eq. (2) of GAIN and x_gen_ = x_ + x_gen * (1 - mask) in train_disc(...)?

def predict(x, cc, nc, mask, gen, z=None, training=False):
    nb_samples = cc.shape[0]
    if z is None:
        z_dim = gen.input[0].shape[-1]
        z = tf.random.normal([nb_samples, z_dim])
    x_ = x * mask
    z_ = z * (1 - mask)
    out = gen([x_, z_, cc, nc, mask], training=training)
++. out =x_+out*(1 - mask)
    if not training:
        return out.numpy()
    return out

It is at least used in score_fn(...) although you may not actually use it for prediction/ inference for test/validation setx_test.

yezhengli-Mr9 commented 4 years ago

OK, out =x_+out*(1 - mask) will help me to present but I think I understand why you stop at Eq. (1) (I mean without that line).

By the way, to be honest, I have question on your setting np.nan zero (and subsequently the way you generate mask) --- it might seems missing np.nan are treated with ground truth zero.

yezhengli-Mr9 commented 4 years ago

While (1) not replacing nan by zeros, I mean: (a) input contains nan (b) enforce nan entry in mask to be zero; and

my MSE/R2 only take into account entries with gene expression value (that is, not `nan`) excluding ones with `nan`as input;

This seems more reasonable than your mask not only in def get_test_extended(...) in gtex_gain_analysis.ipynb but also in gtex_gain.py.

(2) train/test split is slightly different from yours: I have 75%838 individuals to train and 25%838 individuals to test; (3) after one-hour running one loop in for epoch in tqdm.tqdm(range(epochs)) and then if patience == 0: break. (4) NEITHER do I have the line in predict(...) mentioned above.

I can obtain (although not as good as what you showed in Table 1, but I think might still be good) boxplot_r2_WriteCSVdarwin_save_inference_ticker_020 copy and a zoom-in (with ax.set_ylim([0, 1]) ) is

rvinas commented 4 years ago

Should predict(...) be consistent with hat x in Eq. (2) of GAIN and x_gen_ = x_ + x_gen * (1 - mask) in train_disc(...)?
def predict(x, cc, nc, mask, gen, z=None, training=False):
    nb_samples = cc.shape[0]
    if z is None:
        z_dim = gen.input[0].shape[-1]
        z = tf.random.normal([nb_samples, z_dim])
    x_ = x * mask
    z_ = z * (1 - mask)
    out = gen([x_, z_, cc, nc, mask], training=training)
++. out =x_+out*(1 - mask)
    if not training:
        return out.numpy()
    return out
It is at least used in score_fn(...) although you may not actually use it for prediction/ inference for test/validation setx_test.

The line you highlight is perfectly consistent with Eq. (2) of the manuscript - the generator imputes the missing components (that's the out*(1 - mask) part) while the observed components are kept (that's the x * mask part). In other words, out corresponds to x_hat in the manuscript.

rvinas commented 4 years ago

OK, out =x_+out*(1 - mask) will help me to present but I think I understand why you stop at Eq. (1) (I mean without that line).

By the way, to be honest, I have question on your setting np.nan zero (and subsequently the way you generate mask) --- it might seems missing np.nan are treated with ground truth zero.

Could you please point me to the exact line of the code?

yezhengli-Mr9 commented 4 years ago

OK, out =x_+out*(1 - mask) will help me to present but I think I understand why you stop at Eq. (1) (I mean without that line). By the way, to be honest, I have question on your setting np.nan zero (and subsequently the way you generate mask) --- it might seems missing np.nan are treated with ground truth zero.

Could you please point me to the exact line of the code?

Thanks for your response: (1) OK, I think that exact line is a relatively small issue now. (well, I actually think your def predict(...) does not include that exact line but def predict(...) stopping at Eq. (1) still somehow make sense to me -- a relatively small issue now).

(2) now I doubt more on your "missing components are set to zero" mentioned in this issue and the way of generating the mask in def get_mask_hint_b(...) in gtex_gain.py: in other word, your Fig. 2 is good but perhaps, by large percentage, due to effect of imputation of zeros: imputation of zeros to be good (that is, missing components that do not even have ground-truth gene expressions).

"by large percentage", I mean the boxplots I show above has removed the effect of imputation of zeros (that is, missing components that do not even have ground truth gene expressions) -- I remove it not only in final boxplots, but adjusting def get_mask_hint_b(...) for training procedure as well.

rvinas commented 4 years ago

(1) I do not understand what is the issue in that line. (2) During training, missing components are set to zero here and here.

What is not clear about the way we generate the mask and hint vectors in get_mask_hint_b?

To obtain the R2 scores, in compute_r2_score from this notebook, note that we only take into account the missing components, that is, the components for which the mask is 0. In that function, by setting mask_r[mask_r == 1] = np.nan, we are leaving out the observed components.

I am not sure I fully understand what you mean by "the effect of imputation of zeros". Just to clarify, we do not impute the unexpressed genes (e.g. those with gene expression 0), but rather the genes whose mask component is 0.

yezhengli-Mr9 commented 4 years ago

get_mask_hint_b

(1) I do not understand what is the issue in that line. (2) During training, missing components are set to zero here and here.

What is not clear about the way we generate the mask and hint vectors in get_mask_hint_b?

To obtain the R2 scores, in compute_r2_score from this notebook, note that we only take into account the missing components, that is, the components for which the mask is 0. In that function, by setting mask_r[mask_r == 1] = np.nan, we are leaving out the observed components.

I am not sure I fully understand what you mean by "the effect of imputation of zeros". Just to clarify, we do not impute the unexpressed genes (e.g. those with gene expression 0), but rather the genes whose mask component is 0.

Sorry for my confusing description of my question -- but I think I understand your response: I know mask_r[mask_r == 1] = np.nan in compute_r2_score of the notebook; I know you are all the way "imputing the genes whose mask component is 0".

For (2), the part I doubt is here: (a) I know you are all the way imputing the genes whose mask component is 0, but I think you will impute (some of) unexpressed genes as well -- and then in loss functions supervised_loss(...), generator_loss(...), score_fn(...), etc. , you seems treating "unexpressed genes (e.g. those with gene expression 0)" as genes with ground truth gene expression zero, NOT unexpressed genes.

(b) I think your mask has nothing to do with "unexpressed genes". However, I think you should remove effect of "unexpressed genes" in any loss function mentioned above, any evaluation procedure (compute_r2_score(...) and compute_mse_score(...) in gtex_gain_analysis.ipynb)

yezhengli-Mr9 commented 4 years ago

(1) I do not understand what is the issue in that line. (2) During training, missing components are set to zero here and here.

What is not clear about the way we generate the mask and hint vectors in get_mask_hint_b?

To obtain the R2 scores, in compute_r2_score from this notebook, note that we only take into account the missing components, that is, the components for which the mask is 0. In that function, by setting mask_r[mask_r == 1] = np.nan, we are leaving out the observed components.

I am not sure I fully understand what you mean by "the effect of imputation of zeros". Just to clarify, we do not impute the unexpressed genes (e.g. those with gene expression 0), but rather the genes whose mask component is 0.

To better illustrate my doubt, it worth mentioning that I prefer generating mask as following: (step 1) in the input load_gtex, unexpressed gene expressions are set np.nan rather than zero; (step 2) every time after calling get_mask_hint_b(...), I correct the mask. For example, in line 255

            mask, hint, b = get_mask_hint_b(bs, nb_genes, b_low=b_low, b_high=b_high)
++          mask_nonnan = x==x
++          x[~mask_nonnan] = .0
++          mask = mask* mask_nonnan
            z = tf.random.normal([bs, z_dim])

            disc_loss = train_disc(x, z, cc, nc, mask, hint, b, gen, disc, disc_opt)

To conclude, I think when imputing genes whose mask component is 0, (1a) the training procedure should avoid imputing ANY unexpressed genes; (2a) for inference/testing procedure, you can definitely provide imputation (for ANY genes including ANY unexpressed genes); however, evaluation procedure (of inference/testing result) should avoid ANY unexpressed genes (compute_r2_score(...) and compute_mse_score(...) in gtex_gain_analysis.ipynb.

Correspondingly, my this correction of mask plays roles in (1b) loss functions supervised_loss(...), generator_loss(...), def score_fn(...), etc. (2b) evaluation procedure (compute_r2_score(...) and compute_mse_score(...) in gtex_gain_analysis.ipynb.

yezhengli-Mr9 commented 4 years ago

(1) I do not understand what is the issue in that line. (2) During training, missing components are set to zero here and here. What is not clear about the way we generate the mask and hint vectors in get_mask_hint_b? To obtain the R2 scores, in compute_r2_score from this notebook, note that we only take into account the missing components, that is, the components for which the mask is 0. In that function, by setting mask_r[mask_r == 1] = np.nan, we are leaving out the observed components. I am not sure I fully understand what you mean by "the effect of imputation of zeros". Just to clarify, we do not impute the unexpressed genes (e.g. those with gene expression 0), but rather the genes whose mask component is 0.

To better illustrate my doubt, it worth mentioning that I prefer generating mask as following: (step 1) in the input load_gtex, unexpressed gene expressions are set np.nan rather than zero; (step 2) every time after calling get_mask_hint_b(...), I correct the mask. For example, in line 255
            mask, hint, b = get_mask_hint_b(bs, nb_genes, b_low=b_low, b_high=b_high)
++          mask_nonnan = x==x
++          x[~mask_nonnan] = .0
++          mask = mask* mask_nonnan
            z = tf.random.normal([bs, z_dim])

            disc_loss = train_disc(x, z, cc, nc, mask, hint, b, gen, disc, disc_opt)
To conclude, I think when imputing genes whose mask component is 0, (1a) the training procedure should avoid imputing ANY unexpressed genes; (2a) for inference/ testing procedure, you can provide imputation (for ANY genes including ANY unexpressed genes), but evaluation procedure (of testing result) should avoid ANY unexpressed genes (compute_r2_score(...) and compute_mse_score(...) in gtex_gain_analysis.ipynb -- correspondingly, my this correction of mask plays roles in (1b) loss functions supervised_loss(...), generator_loss(...), def score_fn(...), etc. (2b) evaluation procedure (compute_r2_score(...) and compute_mse_score(...) in gtex_gain_analysis.ipynb.

An extreme case of not making such correction is that: your boxplots in Fig. 2 are good, but they are due to (I am assuming "extreme case" to illustrate my point of view)

imputation of zeros for (large amount of) unexpressed genes to be good -- since you treat unexpressed genes actually have ground truth zero, not "unexpressed" -- they play roles In both loss functions and evaluation procedures.
NOT imputation of observed genes (i.e. with ground truth but) with mask component 0, to be good. -- this should be the only focus of attention during evaluation procedures.

rvinas commented 4 years ago

I think I understand your concern. Let me provide some comments:

When I talked about "unexpressed genes", I meant genes with expression 0. This is different from, say, "missing genes", e.g. genes whose expression is unobserved.
If I understand correctly, your point is that we should assume that the training data has missing values, e.g. for certain samples, the expression of certain genes is not measured. I believe this is what the original GAIN implementation does and it is a fair point.
However, our approach is different. Instead of evaluating the performance on the "missing components of the train set" (this is done in the original GAIN), we evaluate the results on an unobserved test set. In particular, we use the train set without missing values to train the model and we dynamically sample random masks at train time.

Our approach is advantageous in the sense of data augmentation, which mitigates the problem of having a scarce number of high-dimensional transcriptomics samples to train a deep model. In other words, our model gets to see a higher variety of samples at training time, e.g. several masks are applied to each training sample, effectively increasing the number of training samples. This is in contrast to the other approach where a single, static mask is applied to each training sample.

Then, we evaluate the performance on an unseen test set once the model is trained (in contrast to the other approach). By doing so, we aim to test the ability of our model to generalise to datasets with missing values. We hope that our model trained on GTEx is general enough to perform well on data collected in independent studies (e.g. typically studies where the number of samples is low), without having to train the model on those small datasets [*]. That said, the generalisation error is exactly what we report for TCGA, e.g. we used the model trained on GTEx and evaluated its performance on TCGA without any training or fine-tuning.

To sum up, we do not expect users to train the model from scratch on small, independent datasets with missing values (although this is possible with the approach that you mention), but rather use the model trained on GTEx to infer the missing values. I hope this clarifies your concern.

[*] This is why we decided to train the model on GTEx - the dataset is very comprehensive and it has samples from a broad variety of tissue types. It would be interesting to investigate whether the model benefits from fine-tuning on independent datasets, potentially using the approach of having static masks.

yezhengli-Mr9 commented 4 years ago

I think I understand your concern. Let me provide some comments:

When I talked about "unexpressed genes", I meant genes with expression 0. This is different from, say, "missing genes", e.g. genes whose expression is unobserved.

If I understand correctly, your point is that we should assume that the training data has missing values, e.g. for certain samples, the expression of certain genes is not measured. I believe this is what the original GAIN implementation does and it is a fair point.

Glad that helps! I was afraid that my explanation is too confusing (esp. when the point (2) is different from the subject/ title of this issue "Should predict(..) corresponds to Eq. (2) and train_disc(...)?". OK, I tried to use your words "unexpressed gens", etc. to elaborate my opinions. I think you understand my concern now~

However, our approach is different. Instead of evaluating the performance on the "missing components of the train set" (this is done in the original GAIN), we evaluate the results on an unobserved test set. In particular, we use the train set without missing values to train the model and we dynamically sample random masks at train time.

Yes, I think you "evaluate the results on an unobserved test set" and you "dynamically sample random masks at train time".

Our approach is advantageous in the sense of data augmentation, which mitigates the problem of having a scarce number of high-dimensional transcriptomics samples to train a deep model. In other words, our model gets to see a higher variety of samples at training time, e.g. several masks are applied to each training sample, effectively increasing the number of training samples. This is in contrast to the other approach where a single, static mask is applied to each training sample.

OK, "deep model" definitely have advantageous (whatever how different you and I preprocess the data and whether or not masks are corrected). OK, I did not check other methods (the two you mentioned were not implemented and on my side, we compared with some regression-based methods) and let me verify whether they do "static mask". Anyway, in GAN, "missing at random" in the mask makes sense.

Then, we evaluate the performance on an unseen test set once the model is trained (in contrast to the other approach). By doing so, we aim to test the ability of our model to generalise to datasets with missing values. We hope that our model trained on GTEx is general enough to perform well on data collected in independent studies (e.g. typically studies where the number of samples is low), without having to train the model on those small datasets [*].

Yes, I understand your point of "_unseen_test", not strange (although "in contrast to other approaches"). I ran through you codes and believe this is acceptable to me.

"generalise to datasets" is also the aim/ goal of these GTEx publications as well. Yep, but as I explained (and I believe you understand now),

both (1) training procedure and (2) evaluation procedure have to avoid treating "missing genes" as "genes" with ground truth zero.

That said, the generalisation error is exactly what we report for TCGA, e.g. we used the model trained on GTEx and evaluated its performance on TCGA without any training or fine-tuning.

Well, actually I used "normalized" data and I do not know much about biology (only a little bit for checking other teammates' data preprocessing). Anyway, this should not affect our data second-stage preprocessing or mask issue or neural network architecture. "fine-tuning" is a different issue (I also did various DL projects in time series, NLP, CV and think "fine-tuning" is just "fine-tuning").

To sum up, we do not expect users to train the model from scratch on small, independent datasets with missing values (although this is possible with the approach that you mention), but rather use the model trained on GTEx to infer the missing values. I hope this clarifies your concern.

Yep, I think you understand my concern (2) now although it is different from what the subject/ title of this issue ("Should predict(..) corresponds to Eq. (2) and train_disc(...)?") is. I think my training/validation datasets should be same as yours, I only have that correction of mask (as I said, mask should be corrected and the correction is actually very easy).

[*] This is why we decided to train the model on GTEx - the dataset is very comprehensive and it has samples from a broad variety of tissue types. It would be interesting to investigate whether the model benefits from fine-tuning on independent datasets, potentially using the approach of having static masks.

yezhengli-Mr9 commented 4 years ago

While (1) not replacing nan by zeros, I mean: (a) input contains nan (b) enforce nan entry in mask to be zero; and

my MSE/R2 only take into account entries with gene expression value (that is, not nan) excluding ones with nanas input;

This seems more reasonable than your mask not only in def get_test_extended(...) in gtex_gain_analysis.ipynb but also in gtex_gain.py.

(2) train/test split is slightly different from yours: I have 75%838 individuals to train and 25%838 individuals to test; (3) after one-hour running one loop in for epoch in tqdm.tqdm(range(epochs)) and then if patience == 0: break. (4) NEITHER do I have the line in predict(...) mentioned above.

I can obtain (although not as good as what you showed in Table 1, but I think might still be good) and a zoom-in (with ax.set_ylim([0, 1]) ) is

Oh, in case you might correct your paper's corresponding results, have to mention (5) only consider chromosome 1 (only 1732 protein-coding genes, not 12,557).

rvinas commented 4 years ago

I still do not understand what is the problem with the mask and genes with zero expression. Also, why should we only consider chromosome 1 (e.g. 1732 protein-coding genes) when you can use all the information? This doesn't make sense to me.

yezhengli-Mr9 commented 4 years ago

I still do not understand what is the problem with the mask and genes with zero expression.

Let me find codes of "neural network on GTEx" to show/convince you that: your loss function in score_fn(...) and in generator_loss(...) and in evaluation procedure and in supervised_loss(...) should be

$\sum_{g}\sum_{t}\ell_{gt:\{m_{gt}=0 \text{ and the gth gene in tth tissue has ground truth (I mean not intrisincally missing)} \} } (\hat{x}_{gt},x_{gt})$

= $\sum_{g}\sum_{t}\ell_{gt:\{m_{gt}=0 , x_{gt}\ne 0/nan \} } (\hat{x}_{gt},x_{gt}) =\sum_{g,t:\{m_{gt}=0, x_{gt}\ne 0/nan \} }\ell_{gt}(\hat{x}_{gt},x_{gt})$

instead of just $\sum_{g}\sum_{t} \ell_{gt:\{m_{gt}=0\}}(\hat{x}_{gt},x_{gt})=\sum_{g,t:\{m_{gt}=0/nan\}} \ell_{gt}(\hat{x}_{gt},x_{gt})$

= $\sum_{g,t:\{m_{gt}=0, x_{gt} \ne 0/nan \}} \ell_{gt}(\hat{x}_{gt},x_{gt}) \oplus \sum_{g,t:\{m_{gt}=0, x_{gt} =0/nan \}} \ell_{gt}(\hat{x}_{gt},0)$ . where the second term cannot be explained.

Actually I think this should be standard (removal of second term) in all machine learning algorithms; just in case you zero_padding the input and you might forget it. This will likely make results in GAIN-GTEx too good to be true?

Also, why should we only consider chromosome 1 (e.g. 1732 protein-coding genes) when you can use all the information? This doesn't make sense to me.

This is just a demo on my side (on chromosome 1 with 1732).

yezhengli-Mr9 commented 4 years ago

I still do not understand what is the problem with the mask and genes with zero expression.

Let me find codes "of neural network on GTEx" to show: your loss function (in score_fn(...) and in generator_loss(...) and in evaluation procedure and in supervised_loss(...)) should be
$\sum_{g}\sum_{t}\ell_{gt:\{m_{gt}=0 \text{ and the gth gene in tth tissue has ground truth (I mean not intrisincally missing)} \} } (\hat{x}_{gt},x_{gt})$
= $\sum_{g}\sum_{t}\ell_{gt:\{m_{gt}=0, x_{gt}\ne 0/nan \} } (\hat{x}_{gt},x_{gt}) =\sum_{g,t:\{m_{gt}=0, x_{gt}\ne 0/nan \} }\ell_{gt}(\hat{x}_{gt},x_{gt})$

instead of just
$\sum_{g}\sum_{t} \ell_{gt:\{m_{gt}=0\}}(\hat{x}_{gt},x_{gt})=\sum_{g,t:\{m_{gt}=0\}} \ell_{gt}(\hat{x}_{gt},x_{gt})$
= $\sum_{g,t:\{m_{gt}=0, x_{gt} \ne 0/nan \}} \ell_{gt}(\hat{x}_{gt},x_{gt}) \oplus \sum_{g,t:\{m_{gt}=0, x_{gt} =0/nan \}} \ell_{gt}(\hat{x}_{gt},0)$ .

where the second term cannot be explained.

Actually I think this should be standard (removal of second term) in all machine learning algorithms; just in case you zero_padding the input and you might forget it. This will probably makes results in GAIN-GTEx too good?

Also, why should we only consider chromosome 1 (e.g. 1732 protein-coding genes) when you can use all the information? This doesn't make sense to me.

This is just a demo on my side (on chromosome 1 with 1732).

Will try to find a simple and convincing codes (for peer-viewed published work) handling intrinsically missing values like GTEx, NOT just with complete data like MINIST or CelebA although not for sure (1) whether or not you will ring a bell by the mathematical formula, (2) how much time I can search for simple and convincing codes today -- this takes time because

(a) Some GAN work/ codes for "missing imputation" actually analyze on compete data (for example, MNIST, etc. that is they do not have input with intrinsic missing values) although they do pay attention to my second to last issue.
(b) Some codes are not simple enough, or not talking about missing imputation.
(c) Some are not peer-reviewed
(d) ...

yezhengli-Mr9 commented 4 years ago

I still do not understand what is the problem with the mask and genes with zero expression.

Let me find codes "of neural network on GTEx" to show: you loss function (in score_fn(...) and in generator_loss(...) and in evaluation procedure and in supervised_loss(...)) should be
$\sum_{g}\sum_{t}\ell_{gt:\{m_{gt}=0 \text{ and the gth gene in tth tissue has ground truth (I mean not intrisincally missing)} \} } (\hat{x}_{gt},x_{gt})$
= $\sum_{g}\sum_{t}\ell_{gt:\{m_{gt}=0, x_{gt}\ne 0/nan \} } (\hat{x}_{gt},x_{gt}) =\sum_{g,t:\{m_{gt}=0, x_{gt}\ne 0/nan \} }\ell_{gt}(\hat{x}_{gt},x_{gt})$ But actually I think this should be standard in all machine learning algorithms; just in case you zero_padding the input and you might forget it. This will probably makes results in GAIN-GTEx too good?

Also, why should we only consider chromosome 1 (e.g. 1732 protein-coding genes) when you can use all the information? This doesn't make sense to me.

This is just a demo on my side (on chromosome 1 with 1732).

Will try to find a simple and convincing codes (for peer-viewed published work) handling intrinsically missing values like GTEx, NOT just with complete data like MINIST or CelebA although not for sure (1) whether or not you will ring a bell by the mathematical formula, (2) how much time I can search for simple and convincing codes today -- this takes time because

(a) Some GAN work/ codes for "missing imputation" actually analyze on compete data (for example, MNIST, etc. that is they do not have input with intrinsic missing values) although they do pay attention to my second to last issue.

(b) Some codes are not simple enough, or not talking about missing imputation.

(c) Some are not peer-reviewed

(d) ...

Before I finializing an example from neural network, let me provide a classical machine learning example: Eq. (5) In Tensor factorization for missing data imputation in medical questionnaires 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2012. handling input data with loss function removing unobserved genes/ intrinsic missing value part (although explaining their complicated codes takes time and effort). However, not removed in your loss function in score_fn(...) and in generator_loss(...) and in evaluation procedure and in supervised_loss(...). I believe this removal is standard for all responsible work handling input data with intrinsic missing values. Can you explain why this does not appear in your implementation?

I will come up with a neural network example soon if you still do not understand. To emphasize, this is not a trivial correction: this will likely make results in GAIN-GTEx too good to be true?

yezhengli-Mr9 commented 4 years ago

I still do not understand what is the problem with the mask and genes with zero expression.

Let me find codes "of neural network on GTEx" to show: you loss function (in score_fn(...) and in generator_loss(...) and in evaluation procedure and in supervised_loss(...)) should be
$\sum_{g}\sum_{t}\ell_{gt:\{m_{gt}=0 \text{ and the gth gene in tth tissue has ground truth (I mean not intrisincally missing)} \} } (\hat{x}_{gt},x_{gt})$
= $\sum_{g}\sum_{t}\ell_{gt:\{m_{gt}=0, x_{gt}\ne 0/nan \} } (\hat{x}_{gt},x_{gt}) =\sum_{g,t:\{m_{gt}=0, x_{gt}\ne 0/nan \} }\ell_{gt}(\hat{x}_{gt},x_{gt})$

instead of just
$\sum_{g}\sum_{t} \ell_{gt:\{m_{gt}=0\}}(\hat{x}_{gt},x_{gt})=\sum_{g,t:\{m_{gt}=0\}} \ell_{gt}(\hat{x}_{gt},x_{gt})$
= $\sum_{g,t:\{m_{gt}=0, x_{gt} \ne 0/nan \}} \ell_{gt}(\hat{x}_{gt},x_{gt}) \oplus \sum_{g,t:\{m_{gt}=0, x_{gt} =0/nan \}} \ell_{gt}(\hat{x}_{gt},0)$ .

But actually I think this should be standard in all machine learning algorithms; just in case you zero_padding the input and you might forget it. This will probably makes results in GAIN-GTEx too good?

Also, why should we only consider chromosome 1 (e.g. 1732 protein-coding genes) when you can use all the information? This doesn't make sense to me.

This is just a demo on my side (on chromosome 1 with 1732).

Will try to find a simple and convincing codes (for peer-viewed published work) handling intrinsically missing values like GTEx, NOT just with complete data like MINIST or CelebA although not for sure (1) whether or not you will ring a bell by the mathematical formula, (2) how much time I can search for simple and convincing codes today -- this takes time because

(a) Some GAN work/ codes for "missing imputation" actually analyze on compete data (for example, MNIST, etc. that is they do not have input with intrinsic missing values) although they do pay attention to my second to last issue.

(b) Some codes are not simple enough, or not talking about missing imputation.

(c) Some are not peer-reviewed

(d) ...

Before I finializing an example from neural network, let me provide a classical machine learning example: Eq. (5) In Tensor factorization for missing data imputation in medical questionnaires 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2012. handling input data with loss function removing unobserved genes/ intrinsic missing value part (although explaining their complicated codes takes time and effort). However, not removed in your loss function in score_fn(...) and in generator_loss(...) and in evaluation procedure and in supervised_loss(...). I believe this removal is standard for all responsible work handling input data with intrinsic missing values. Can you explain why this does not appear in your implementation?

I will come up with a neural network example soon if you still do not understand.

Let me provide you a baby neural network example (their randomness is different, but both their main body and their code following this standard quite well): McCoy, John T., Steve Kroon, and Lidia Auret. ["Variational autoencoders for missing data imputation with application to a simulated milling circuit."]() IFAC-PapersOnLine 51.21 (2018): 141-146.

In their section 3.5 "replacing a randomly selected variable in each of these rows with NaN (not a number)" and "(not a number)" implies they already treat input data with unobserved values seriously. In their code, line 276:next_batch(Xdata,batch_size, MissingVals = False) show that they only pick samples with no unobserved values. Indeed, they have the option MissingVals = True but to my understanding (and they emphasize "(not a number)" in section 3.5), option MissingVals = True does not make sense:

if the loss function contains $\ell_{gt:\{m_{gt}=0 \} } (\hat{x}_{gt}, 0)$ where 0 is from nan/ unobserved value, how do you explain loss function $\sum_{g,t}\ell_{gt:\{m_{gt}=0 \} } (\hat{x}_{gt}, x_{gt})= \sum_{g,t}\ell_{gt:\{m_{gt}=0, x_{gt}\ne 0 \} } (\hat{x}_{gt}, x_{gt}) \oplus \sum_{g,t}\ell_{gt:\{m_{gt}=0 , x_{gt} = 0\} } (\hat{x}_{gt}, 0)$ in your implementation?

I am finding better/ closer/ more similar code examples for published results to explain this if you still do not get the doubt/ confusing point... To emphasize again, this is not a trivial correction: this will likely make results in GAIN-GTEx too good to be true?

rvinas commented 4 years ago

Some comments:

The fact that the mask is 0 for a particular gene g means that the expression of g is missing. Therefore, the loss functions that you mention earlier (where the sum is over the missing components, e.g. m_{gt}=0) cannot be used, as the assumption is that we do not have the ground truth for the missing components. Our loss function cannot be written in the form that you suggest.
Instead, here (and also in GAIN) we divide the loss function into two parts:
1. The supervised loss: It computes the MSE between the observed genes (those for which the ground truth is available, e.g. with mask=1) and the reconstructed expressions. This is similar to the reconstruction loss of an autoencoder.
2. The adversarial loss (consisting of the generator and the discriminator losses). This loss is used for the missing components (e.g. those with mask=0). Since in this case we do not know the ground truth expression, the generator loss for these components is based on the ability of the discriminator to distinguish whether each missing component has been observed or imputed by the generator. In no case this uses information about the ground truth expression of the missing components.
It is true that we set expression values of the missing genes to zero, e.g. x_g = 0 for each missing gene g. However, as discussed in the previous point, we never use these zero values as ground truth or labels to train the network (which wouldn't make any sense). This is done similarly in GAIN. This doesn't inflate our results because (1) we compute the R2 scores only on the imputed components (as opposed to the reconstructed components) and (2) even if using zeros for the missing components somehow biases the results (e.g. towards 0), the R2 score would penalise this as it measures the proportion of explained variance.
We do not select a subset of genes in advance and predict the expression of those genes via supervised, multivariate regression.
We are performing imputation within a sample from a single tissue. That is, we do not leverage information about samples from other tissues within the same patient. This is another reason why the equations that you suggest do not apply here. That said, predicting the expression of a tissue a from the expressions of tissues b and c (e.g., multi-tissue imputation) is an extremely interesting problem that we may want to address in the future.

yezhengli-Mr9 commented 4 years ago

Some comments:

The fact that the mask is 0 for a particular gene g means that the expression of g is missing. Therefore, the loss functions that you mention earlier (where the sum is over the missing components, e.g. m_{gt}=0) cannot be used, as the assumption is that we do not have the ground truth for the missing components. Our loss function cannot be written in the form that you suggest.

Glad to get the point! I think this is my only confusing point -- you saved me tons of time (actually from last 12 hours ago, I am busy finding closer/better codes for you). I think you understand my point but according to your implementation (as well as zero padding for intrinsic missing values), your generator_loss(...) and score_fn(...) and supervised_loss are implmementing the loss functions as

$\sum_{g}\sum_{t} \ell_{gt:\{m_{gt}=0\}}(\hat{x}_{gt},x_{gt})=\sum_{g,t:\{m_{gt}=0\}} \ell_{gt}(\hat{x}_{gt},x_{gt})$

= $\sum_{g,t:\{m_{gt}=0, x_{gt} \ne 0/nan \}} \ell_{gt}(\hat{x}_{gt},x_{gt}) \oplus \sum_{g,t:\{m_{gt}=0, x_{gt} =0/nan \}} \ell_{gt}(\hat{x}_{gt},0)$ . instead of correct ones:

$\sum_{g}\sum_{t}\ell_{gt:\{m_{gt}=0 \text{ and the gth gene in tth tissue has ground truth (I mean not intrisincally missing)} \} } (\hat{x}_{gt},x_{gt})$

= $\sum_{g}\sum_{t}\ell_{gt:\{m_{gt}=0, x_{gt}\ne 0/nan \} } (\hat{x}_{gt},x_{gt}) =\sum_{g,t:\{m_{gt}=0, x_{gt}\ne 0/nan \} }\ell_{gt}(\hat{x}_{gt},x_{gt})$ At least, after my re-running your code, I believe there is no part you handle this. Below I mention in detail why this is different from GAIN -- I checked at least three of their five real datasets, their raw data does not have intrinsic missing values, but GTEx have a large amount.

Instead, here (and also in GAIN) we divide the loss function into two parts:

The supervised loss: It computes the MSE between the observed genes (those for which the ground truth is available, e.g. with mask=1) and the reconstructed expressions. This is similar to the reconstruction loss of an autoencoder.

The adversarial loss (consisting of the generator and the discriminator losses). This loss is used for the missing components (e.g. those with mask=0). Since in this case we do not know the ground truth expression, the generator loss for these components is based on the ability of the discriminator to distinguish whether each missing component has been observed or imputed by the generator. In no case this uses information about the ground truth expression of the missing components.

It is true that we set expression values of the missing genes to zero, e.g. x_g = 0 for each missing gene g. However, as discussed in the previous point, we never use these zero values as ground truth or labels to train the network (which wouldn't make any sense). This is done similarly in GAIN.

I think I checked GAIN's tables of course: but they do not have intrinsic missing values at least in dataset Breast, Letter, Credit(; for , "Spam" and "News" I do not even not which dataset or their preprocessing procedures).

This doesn't inflate our results because (1) we compute the R2 scores only on the imputed components (as opposed to the reconstructed components) and (2) even if using zeros for the missing components somehow biases the results (e.g. towards 0), the R2 score would penalise this as it measures the proportion of explained variance.

Yes, glad you get my point -- "somehow biases the results (e.g. towards 0)" is the only doubt I have. I think GTEx's raw data has intrinsic missing values, as a result should be handled more carefully than GAIN. I still think "somehow biases the results (e.g. towards 0)" exists in your code, (1) this could be avoided without much difficulty? I did not see the reason why you should keep them there; (2) Let me think how "the R2 score would penalise this as it measures the proportion of explained variance"...However, after one-hour mull over your this sentence just now, I do not think this explained anything.

I think you have such "somehow biases the results (e.g. towards 0)" in your

(a) training procedure: generator_loss(...) and score_fn(...) and supervised_loss (b) evaluation procedure: compute_r2_score(...) and compute_mse_score(...) in gtex_gain_analysis.ipynb.

We do not select a subset of genes in advance and predict the expression of those genes via supervised, multivariate regression.

Sorry about this point. I mentioned that my boxplots are demo. Do not worry about this part -- were it not for the case I have to explain my doubt by looking for other better/closer codes, I will have already provided you updated version of boxplots on all relevant genes.

We are performing imputation within a sample from a single tissue. That is, we do not leverage information about samples from other tissues within the same patient. This is another reason why the equations that you suggest do not apply here. That said, predicting the expression of a tissue a from the expressions of tissues b and c (e.g., multi-tissue imputation) is an extremely interesting problem that we may want to address in the future.

yezhengli-Mr9 commented 4 years ago

We are performing imputation within a sample from a single tissue. That is, we do not leverage information about samples from other tissues within the same patient. This is another reason why the equations that you suggest do not apply here. That said, predicting the expression of a tissue a from the expressions of tissues b and c (e.g., multi-tissue imputation) is an extremely interesting problem that we may want to address in the future.

yezhengli-Mr9 commented 4 years ago

Oh, sorry, I mistakenly "closed this issue" just now...

yezhengli-Mr9 commented 4 years ago

Some comments:

The fact that the mask is 0 for a particular gene g means that the expression of g is missing. Therefore, the loss functions that you mention earlier (where the sum is over the missing components, e.g. m_{gt}=0) cannot be used, as the assumption is that we do not have the ground truth for the missing components. Our loss function cannot be written in the form that you suggest.

Instead, here (and also in GAIN) we divide the loss function into two parts:

The supervised loss: It computes the MSE between the observed genes (those for which the ground truth is available, e.g. with mask=1) and the reconstructed expressions. This is similar to the reconstruction loss of an autoencoder.

The adversarial loss (consisting of the generator and the discriminator losses). This loss is used for the missing components (e.g. those with mask=0). Since in this case we do not know the ground truth expression, the generator loss for these components is based on the ability of the discriminator to distinguish whether each missing component has been observed or imputed by the generator. In no case this uses information about the ground truth expression of the missing components.

It is true that we set expression values of the missing genes to zero, e.g. x_g = 0 for each missing gene g. However, as discussed in the previous point, we never use these zero values as ground truth or labels to train the network (which wouldn't make any sense). This is done similarly in GAIN. This doesn't inflate our results because (1) we compute the R2 scores only on the imputed components (as opposed to the reconstructed components) and (2) even if using zeros for the missing components somehow biases the results (e.g. towards 0), the R2 score would penalise this as it measures the proportion of explained variance.

Hi Ramon, in case I provided detailed response to your comment just now and the case: (1) I finally make you understand my confusing words/expression of my this doubt; (2) I re-run and read every line of your code,

How about let us simplify the last question: would you mind pointing out

how you avoiding "using zeros for the missing components somehow biases the results" in your algorithm

(while "missing components" might be confusing, I have to emphasize, I am talking about unobserved genes/ intrinsic missing values) where you set zero padding there)

since I believe the algorithm "inflate our results" by having generator_loss(...) and score_fn(...) and supervised_loss implemented as

$\sum_{g}\sum_{t} \ell_{gt:\{m_{gt}=0\}}(\hat{x}_{gt},x_{gt})=\sum_{g,t:\{m_{gt}=0\}} \ell_{gt}(\hat{x}_{gt},x_{gt})$

= $\sum_{g,t:\{m_{gt}=0, x_{gt} \ne 0/nan \}} \ell_{gt}(\hat{x}_{gt},x_{gt}) \oplus \sum_{g,t:\{m_{gt}=0, x_{gt} =0/nan \}} \ell_{gt}(\hat{x}_{gt},0)$ . instead of correct ones:

$\sum_{g}\sum_{t}\ell_{gt:\{m_{gt}=0 \text{ and the gth gene in tth tissue has ground truth (I mean not intrisincally missing)} \} } (\hat{x}_{gt},x_{gt})$

= $\sum_{g}\sum_{t}\ell_{gt:\{m_{gt}=0, x_{gt}\ne 0/nan \} } (\hat{x}_{gt},x_{gt}) =\sum_{g,t:\{m_{gt}=0, x_{gt}\ne 0/nan \} }\ell_{gt}(\hat{x}_{gt},x_{gt})$

Well, to avoid irritating you (and indeed thanks very much for your patience to understand my doubt), (1) I refer to your lines of codes since last time you mentioned and in above lines as well; (2) let me emphasize that your mask from get_mask_hint_b(...) has nothing to do with unobserved/ intrinsic missing values (I mean get_mask_hint_b(...) only involves $m_{gt}=0$ , not $x_{gt}=0/nan, x_{gt}\ne 0/nan$

We do not select a subset of genes in advance and predict the expression of those genes via supervised, multivariate regression.

We are performing imputation within a sample from a single tissue. That is, we do not leverage information about samples from other tissues within the same patient. This is another reason why the equations that you suggest do not apply here. That said, predicting the expression of a tissue a from the expressions of tissues b and c (e.g., multi-tissue imputation) is an extremely interesting problem that we may want to address in the future.

rvinas commented 4 years ago

As discussed in my last comment (second point), I do not agree that we are implementing the loss functions that you mention. If I understand your notation correctly, your equation of the "correct" loss is implementing a supervised loss, whereas our approach is fully unsupervised.

I am also not sure what you mean by intrinsic values, but our raw dataset doesn't have any nan. In the preprocessing code, you will see that we take the intersection of genes across all the 49 tissue types. After preprocessing, there isn't any nan in any sample of the dataset.

I feel I have already addressed all your concerns in my previous comments. I hope this helps.

yezhengli-Mr9 commented 4 years ago

Get it, ok I checked your writeup it indeed mentions "We also select the intersection of all the protein-coding genes among these tissues, yielding 12,557 unique human genes". Well, if it is the case (let me check with other teammate who preprocess data, they seems have not waken up and ready for work now),

please forgive me for taking you too much time to explain since in issue 2 I asked `set "missing components" to be zero` and you answered "yes" -- then I indeed think you put `nan` to be zero for input (I was NOT asking "set `mask` to be zero") -- now I think you just refers to x_ = x * mask in gtex_gain.py.

But let me verify with my teammates if it is indeed to have about 12,557 genes left.

As discussed in my last comment (second point), I do not agree that we are implementing the loss functions that you mention.

If your input has no zero padding for nan values since your input has no nan because of "intersection" (I think I misunderstand your "yes" to my question set "missing components" to be zero in issue 2 -- I am to severely blamed), then I agree that your losses are correct.

If I understand your notation correctly, your equation of the "correct" loss is implementing a supervised loss, whereas our approach is fully unsupervised.

I am also not sure what you mean by intrinsic values, but our raw dataset doesn't have any nan. In the preprocessing code, you will see that we take the intersection of genes across all the 49 tissue types. After preprocessing, there isn't any nan in any sample of the dataset.

I feel I have already addressed all your concerns in my previous comments. I hope this helps.

Yes, if your input has no zero_padding for nan values since your input has no nan because of "intersection" (which I misunderstood from "yes" to my question set "missing components" to be zero in issue 2(https://github.com/rvinas/GAIN-GTEx/issues/2), then actually you have already addressed all my concerns and very sorry for taking your time explaining this (I believe it is due to my misunderstanding from "yes" to my question set "missing components" to be zero in issue 2 in gtex_gain.py.

yezhengli-Mr9 commented 4 years ago

As discussed in my last comment (second point), I do not agree that we are implementing the loss functions that you mention. If I understand your notation correctly, your equation of the "correct" loss is implementing a supervised loss, whereas our approach is fully unsupervised.

I am also not sure what you mean by intrinsic values, but our raw dataset doesn't have any nan. In the preprocessing code, you will see that we take the intersection of genes across all the 49 tissue types. After preprocessing, there isn't any nan in any sample of the dataset.

I feel I have already addressed all your concerns in my previous comments. I hope this helps.

Thanks very much for your patience~ I think I understand this issue correctly now.

yezhengli-Mr9 commented 4 years ago

As discussed in my last comment (second point), I do not agree that we are implementing the loss functions that you mention. If I understand your notation correctly, your equation of the "correct" loss is implementing a supervised loss, whereas our approach is fully unsupervised. I am also not sure what you mean by intrinsic values, but our raw dataset doesn't have any nan. In the preprocessing code, you will see that we take the intersection of genes across all the 49 tissue types. After preprocessing, there isn't any nan in any sample of the dataset. I feel I have already addressed all your concerns in my previous comments. I hope this helps.

Thanks very much for your patience~ I think I understand this issue correctly now.

Yeah, I mull over your last two responses and think you clarifies everything to me now; let me check about "12,557 protein-codin genes" then. On my side, it is important for me report to my PhD advisor and teammates, and you provide very decent explanation.

Again, very sorry for misunderstanding your "yes" on set "missing components" to be zero in issue #2 -- in issue #2, I was initially just talking about "any sample of dataset?" (preprocessing dataset/ input), not recognizing that your "yes" just refers to x_ = x * mask in gtex_gain.py (seems to me NOT talking about "any sample of dataset", that is, NOT preprocessing dataset).

rvinas commented 4 years ago

No problem, Yezheng. I am happy that the last two answers were useful :)

yezhengli-Mr9 commented 4 years ago

No problem, Yezheng. I am happy that the last two answers were useful :)

Last one is super useful. I rechecked your previous responses and think actually my that misunderstanding furtherly made me mistunderstand many of your responses. On my side, I have to (1) reproduce your result, or (2) give explanation to my PhD advisor (he initially gave the team here the task with nan in the input so I messed up everything here), why I cannot reproduce.

On the other hand, while "We also select the intersection of all the protein-coding genes among these tissues, yielding 12,557 unique human genes" clarifies that input has no nan, I am still verifying here with the team: in my mind there might still have various preprocessing ways (I mean, ways to throw away nan data) -- seems getting back to issue #2. But let me verify first -- to emphasize, my goal is to (1) reproduce your result, or (2) give explanation to my PhD advisor why I cannot reproduce.

rvinas / GTEx-imputation

"Should predict(..) corresponds to Eq. (2) and train_disc(...)?" and a second concern regarding handling "missing genes" #5

my MSE/R2 only take into account entries with gene expression value (that is, not nan) excluding ones with nanas input;

both (1) training procedure and (2) evaluation procedure have to avoid treating "missing genes" as "genes" with ground truth zero.

my MSE/R2 only take into account entries with gene expression value (that is, not nan) excluding ones with nanas input;

I think you have such "somehow biases the results (e.g. towards 0)" in your

how you avoiding "using zeros for the missing components somehow biases the results" in your algorithm

(while "missing components" might be confusing, I have to emphasize, I am talking about unobserved genes/ intrinsic missing values) where you set zero padding there)

my MSE/R2 only take into account entries with gene expression value (that is, not `nan`) excluding ones with `nan`as input;

my MSE/R2 only take into account entries with gene expression value (that is, not `nan`) excluding ones with `nan`as input;