rvinas / GTEx-imputation

Gene Expression Imputation with Generative Adversarial Imputation Nets
MIT License
11 stars 3 forks source link

any sample of dataset? #2

Closed yezhengli-Mr9 closed 4 years ago

yezhengli-Mr9 commented 4 years ago

Hi Ramon,

(a) x, symbols, sampl_ids, tissues =load_gtex(corrected=CORRECTED) in gtex_gain.py as well as return pd.read_csv(file, index_col=0) # nrows=1000 in utils.py seems reading meta data as well?

(b) I think from several df = pd.read_csv('{}/{}{}.csv'.format(data_dir, prefix, tissue) in utils.py, you have for each tissue, a matrix? (like a matrix of gene and sample_id), correct?

I wish I could get samples of dataset (only sample would be fine in case csvs are large...) in order to run through python gtex_gain.py...

Best, Yezheng

rvinas commented 4 years ago

Hi Yezheng,

a) That's exactly right:

The second return that you mention is just loading the expression data, e.g. x (no meta data).

b) Correct. We initially processed the data separately for each tissue and the function that you mention is merging the tissue-specific data into a single dataframe. This function might not be needed in many situations.

Regarding the samples of the dataset, I am afraid I cannot upload them to the Github repo. However, GTEx is publicly available and all the data can be downloaded from the GTEx portal.

Best wishes, Ramon

yezhengli-Mr9 commented 4 years ago

Hi Yezheng,

a) That's exactly right:

  • x is a m x n matrix containing the expression values, where m is the number of samples and n the number of genes.
  • symbols are the n gene names.
  • sampl_ids are the m sample identifiers.
  • tissues is an array of m values that indicate which tissue each sample belongs to.

The second return that you mention is just loading the expression data, e.g. x (no meta data).

b) Correct. We initially processed the data separately for each tissue and the function that you mention is merging the tissue-specific data into a single dataframe. This function might not be needed in many situations.

Regarding the samples of the dataset, I am afraid I cannot upload them to the Github repo. However, GTEx is publicly available and all the data can be downloaded from the GTEx portal.

Best wishes, Ramon

OK, seems you directly use raw GTEx_Analysis_v8_eQTL_expression_matrices.tar, let me have a try getting directly from raw data-- I thought there might have some preprocessing -- I saved a CSV for each sample/ individual, a matrix n_tissues by n_genes (thought you also have such preprocessing step --referring to your compute_r2_score(x_test_extended, x_gen_extended, mask, nb_tissues=49) in gtex_gain_analysis.ipynb. Thanks,

rvinas commented 4 years ago

Hi Yezheng,

No, we don't use the raw values directly. We processed the data as follows: a. Genes were selected based on expression thresholds of >=0.1 TPM in >=20% of samples and >=6 reads (unnormalized) in >=20% of samples b. Read counts were normalized between samples using TMM (Robinson, M. D. & Oshlack, A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology 11, R25 (2010)). c. Expression values for each gene were inverse normal transformed.

Best wishes, Ramon

yezhengli-Mr9 commented 4 years ago

Hi Yezheng,

No, we don't use the raw values directly. We processed the data as follows: a. Genes were selected based on expression thresholds of >=0.1 TPM in >=20% of samples and >=6 reads (unnormalized) in >=20% of samples b. Read counts were normalized between samples using TMM (Robinson, M. D. & Oshlack, A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology 11, R25 (2010)). c. Expression values for each gene were inverse normal transformed.

Best wishes, Ramon

Get it, thanks~ Let me have a try. On my side, I am actually not in charge of data pre-processing, but they do mention there is difference between "normalized data" GTEx_Analysis_v8_eQTL_expression_matrices.tar and "rawer" count data. Let me have another try. Thanks

yezhengli-Mr9 commented 4 years ago

Hi Ramon, Have you by any chance, set "missing components" to be zero?

rvinas commented 4 years ago

Hi Yezheng,

Yes, missing components are set to zero.

On Tue, 21 Jul 2020, 03:28 Yezheng Li, notifications@github.com wrote:

Hi Ramon, Have you by any chance, set "missing components" to be zero?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/rvinas/GAIN-GTEx/issues/2#issuecomment-661533515, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZG74G747MTDPEYU27MOY3R4TVKTANCNFSM4O32HD7A .

yezhengli-Mr9 commented 4 years ago

Hi Yezheng, Yes, missing components are set to zero.

Is there any max number of samples you can plug into load_gtex(...) as well as gtex_main.py?

(1) by number of samples, I mean len(sampl_id) == len(tissues) == x.shape[0] == len(symbols). I saw comment # nrows=1000 in utils.py) While I filter out complete nan rows (rather than set them zeros), I have at most 14,442 rows. (2) I am not an expert in tensorflow; in pytorch with much smaller neural network, I can plug in 14,442 rows (actually, I can tensorize all 14,442 rows and plug into neural network in pytorch) without any difficulty... (3) I can merely plug in at most 118 rows for 1,732 "protein-coding"genes in chrom-1 only (rather than 12,557 unique human genes). Otherwise will see bugs like the following: (NO INTENTION to trouble you by bugs, just for the completeness of my question itself)

[gtex_gain] x.shape (119, 1732) len(symbols) 119 len(sampl_ids) 119 len(tissues) 119
            SEX    AGE  DTHHRDY
SUBJID
GTEX-1117F    2  60-69      4.0
GTEX-111CU    1  50-59      0.0
GTEX-111FC    1  60-69      1.0
GTEX-111VG    1  60-69      3.0
GTEX-111YS    1  60-69      0.0
Cat covs:  (119, 3)
Vocab sizes:  [2, 5, 32]
2020-07-21 15:48:42.285272: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-07-21 15:48:42.350561: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fb464ed1db0 executing computations on platform Host. Devices:
2020-07-21 15:48:42.350583: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
[gtex_gain] np.sum(np.sum(x_test)) nan np.sum(np.sum(cat_covs_test)) 439 np.sum(np.sum(num_covs_test)) -37.57211
  0%|                                                         | 0/10000 [00:00<?, ?it/s]2020-07-21 15:48:45.535110: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Invalid argument: indices[4] = -1 is not in [0, 5)
     [[{{node model_1/embedding_1/embedding_lookup}}]]
  0%|                                                         | 0/10000 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "gtex_gain.py", line 492, in <module>
    save_fn=save_fn)
  File "gtex_gain.py", line 287, in train
    disc_loss = train_disc(x, z, cc, nc, mask, hint, b, gen, disc, disc_opt)
  File "/Users/yezheng/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 457, in __call__
    result = self._call(*args, **kwds)
  File "/Users/yezheng/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 487, in _call
    return self._stateless_fn(*args, **kwds)  # pylint: disable=not-callable
  File "/Users/yezheng/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1823, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/Users/yezheng/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1141, in _filtered_call
    self.captured_inputs)
  File "/Users/yezheng/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1224, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager)
  File "/Users/yezheng/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 511, in call
    ctx=ctx)
  File "/Users/yezheng/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError:  indices[4] = -1 is not in [0, 5)
     [[node model_1/embedding_1/embedding_lookup (defined at /Users/yezheng/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1751) ]] [Op:__inference_train_disc_1782]

Function call stack:
train_disc

With <=118 number of samples, I can run your code smoothly with output

[gtex_gain] x.shape (118, 1732) len(symbols) 118 len(sampl_ids) 118 len(tissues) 118
            SEX    AGE  DTHHRDY
SUBJID
GTEX-1117F    2  60-69      4.0
GTEX-111CU    1  50-59      0.0
GTEX-111FC    1  60-69      1.0
GTEX-111VG    1  60-69      3.0
GTEX-111YS    1  60-69      0.0
Cat covs:  (118, 3)
Vocab sizes:  [2, 5, 32]
2020-07-21 15:49:10.820407: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-07-21 15:49:10.862897: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fa9cb0dd390 executing computations on platform Host. Devices:
2020-07-21 15:49:10.862917: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
[gtex_gain] np.sum(np.sum(x_test)) nan np.sum(np.sum(cat_covs_test)) 452 np.sum(np.sum(num_covs_test)) -37.335236
  0%|                                                         | 0/10000 [00:00<?, ?it/s]Score: 0.451
Epoch 1. Gen loss: 1.71. Sup loss: 1.06. Disc loss: 0.62
  0%|                                              | 10/10000 [00:07<1:07:19,  2.47it/s]Score: 8.228
Epoch 11. Gen loss: 2.81. Sup loss: 0.81. Disc loss: 0.42
  0%|                                                | 20/10000 [00:09<35:23,  4.70it/s]Score: 2.513
Epoch 21. Gen loss: 3.16. Sup loss: 0.74. Disc loss: 0.39
  0%|▏                                               | 30/10000 [00:11<33:22,  4.98it/s]Score: 0.792
Epoch 31. Gen loss: 3.60. Sup loss: 0.99. Disc loss: 0.39
  0%|▏                                               | 40/10000 [00:13<31:56,  5.20it/s]Score: 1.371
Epoch 41. Gen loss: 4.33. Sup loss: 1.14.
yezhengli-Mr9 commented 4 years ago

Hi Yezheng, Yes, missing components are set to zero.

Is there any max number of samples you can plug into load_gtex(...) as well as gtex_main.py?

(1) by number of samples, I mean len(sampl_id) == len(tissues) == x.shape[0] == len(symbols). I saw comment # nrows=1000 in utils.py) While I filter out complete nan rows (rather than set them zeros), I have at most 14,442 rows. (2) I am not an expert in tensorflow; in pytorch with much smaller neural network, I can plug in 14,442 rows (actually, I can tensorize all 14,442 rows and plug into neural network in pytorch) without any difficulty... (3) I can merely plug in at most 118 rows for 1,732 "protein-coding"genes in chrom-1 only (rather than 12,557 unique human genes). Otherwise will see bugs like the following: (NO INTENTION to trouble you by bugs, just for the completeness of my question itself)

[gtex_gain] x.shape (119, 1732) len(symbols) 119 len(sampl_ids) 119 len(tissues) 119
            SEX    AGE  DTHHRDY
SUBJID
GTEX-1117F    2  60-69      4.0
GTEX-111CU    1  50-59      0.0
GTEX-111FC    1  60-69      1.0
GTEX-111VG    1  60-69      3.0
GTEX-111YS    1  60-69      0.0
Cat covs:  (119, 3)
Vocab sizes:  [2, 5, 32]
2020-07-21 15:48:42.285272: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-07-21 15:48:42.350561: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fb464ed1db0 executing computations on platform Host. Devices:
2020-07-21 15:48:42.350583: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
[gtex_gain] np.sum(np.sum(x_test)) nan np.sum(np.sum(cat_covs_test)) 439 np.sum(np.sum(num_covs_test)) -37.57211
  0%|                                                         | 0/10000 [00:00<?, ?it/s]2020-07-21 15:48:45.535110: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Invalid argument: indices[4] = -1 is not in [0, 5)
   [[{{node model_1/embedding_1/embedding_lookup}}]]
  0%|                                                         | 0/10000 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "gtex_gain.py", line 492, in <module>
    save_fn=save_fn)
  File "gtex_gain.py", line 287, in train
    disc_loss = train_disc(x, z, cc, nc, mask, hint, b, gen, disc, disc_opt)
  File "/Users/yezheng/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 457, in __call__
    result = self._call(*args, **kwds)
  File "/Users/yezheng/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 487, in _call
    return self._stateless_fn(*args, **kwds)  # pylint: disable=not-callable
  File "/Users/yezheng/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1823, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/Users/yezheng/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1141, in _filtered_call
    self.captured_inputs)
  File "/Users/yezheng/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1224, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager)
  File "/Users/yezheng/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 511, in call
    ctx=ctx)
  File "/Users/yezheng/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError:  indices[4] = -1 is not in [0, 5)
   [[node model_1/embedding_1/embedding_lookup (defined at /Users/yezheng/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1751) ]] [Op:__inference_train_disc_1782]

Function call stack:
train_disc

With <=118 number of samples, I can run your code smoothly with output

[gtex_gain] x.shape (118, 1732) len(symbols) 118 len(sampl_ids) 118 len(tissues) 118
            SEX    AGE  DTHHRDY
SUBJID
GTEX-1117F    2  60-69      4.0
GTEX-111CU    1  50-59      0.0
GTEX-111FC    1  60-69      1.0
GTEX-111VG    1  60-69      3.0
GTEX-111YS    1  60-69      0.0
Cat covs:  (118, 3)
Vocab sizes:  [2, 5, 32]
2020-07-21 15:49:10.820407: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-07-21 15:49:10.862897: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fa9cb0dd390 executing computations on platform Host. Devices:
2020-07-21 15:49:10.862917: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
[gtex_gain] np.sum(np.sum(x_test)) nan np.sum(np.sum(cat_covs_test)) 452 np.sum(np.sum(num_covs_test)) -37.335236
  0%|                                                         | 0/10000 [00:00<?, ?it/s]Score: 0.451
Epoch 1. Gen loss: 1.71. Sup loss: 1.06. Disc loss: 0.62
  0%|                                              | 10/10000 [00:07<1:07:19,  2.47it/s]Score: 8.228
Epoch 11. Gen loss: 2.81. Sup loss: 0.81. Disc loss: 0.42
  0%|                                                | 20/10000 [00:09<35:23,  4.70it/s]Score: 2.513
Epoch 21. Gen loss: 3.16. Sup loss: 0.74. Disc loss: 0.39
  0%|▏                                               | 30/10000 [00:11<33:22,  4.98it/s]Score: 0.792
Epoch 31. Gen loss: 3.60. Sup loss: 0.99. Disc loss: 0.39
  0%|▏                                               | 40/10000 [00:13<31:56,  5.20it/s]Score: 1.371
Epoch 41. Gen loss: 4.33. Sup loss: 1.14.

Resolved. This is due to np.nan value appeared in the new column DTHHRDY of new GTEx_Analysis_v8_Annotations_SubjectPhenotypesDS.txt (GTEx portal). np.nan can be better handled than merely df_metadata[cat_cols] = df_metadata[cat_cols].astype('category') (which will ignore np.nan).