Closed yezhengli-Mr9 closed 4 years ago
Hi Yezheng,
a) That's exactly right:
x
is a m x n
matrix containing the expression values, where m
is the number of samples and n
the number of genes. symbols
are the n
gene names.sampl_ids
are the m
sample identifiers.tissues
is an array of m
values that indicate which tissue each sample belongs to.The second return that you mention is just loading the expression data, e.g. x
(no meta data).
b) Correct. We initially processed the data separately for each tissue and the function that you mention is merging the tissue-specific data into a single dataframe. This function might not be needed in many situations.
Regarding the samples of the dataset, I am afraid I cannot upload them to the Github repo. However, GTEx is publicly available and all the data can be downloaded from the GTEx portal.
Best wishes, Ramon
Hi Yezheng,
a) That's exactly right:
x
is am x n
matrix containing the expression values, wherem
is the number of samples andn
the number of genes.symbols
are then
gene names.sampl_ids
are them
sample identifiers.tissues
is an array ofm
values that indicate which tissue each sample belongs to.The second return that you mention is just loading the expression data, e.g.
x
(no meta data).b) Correct. We initially processed the data separately for each tissue and the function that you mention is merging the tissue-specific data into a single dataframe. This function might not be needed in many situations.
Regarding the samples of the dataset, I am afraid I cannot upload them to the Github repo. However, GTEx is publicly available and all the data can be downloaded from the GTEx portal.
Best wishes, Ramon
OK, seems you directly use raw GTEx_Analysis_v8_eQTL_expression_matrices.tar
, let me have a try getting directly from raw data-- I thought there might have some preprocessing -- I saved a CSV for each sample/ individual, a matrix n_tissues by n_genes (thought you also have such preprocessing step --referring to your compute_r2_score(x_test_extended, x_gen_extended, mask, nb_tissues=49)
in gtex_gain_analysis.ipynb.
Thanks,
Hi Yezheng,
No, we don't use the raw values directly. We processed the data as follows: a. Genes were selected based on expression thresholds of >=0.1 TPM in >=20% of samples and >=6 reads (unnormalized) in >=20% of samples b. Read counts were normalized between samples using TMM (Robinson, M. D. & Oshlack, A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology 11, R25 (2010)). c. Expression values for each gene were inverse normal transformed.
Best wishes, Ramon
Hi Yezheng,
No, we don't use the raw values directly. We processed the data as follows: a. Genes were selected based on expression thresholds of >=0.1 TPM in >=20% of samples and >=6 reads (unnormalized) in >=20% of samples b. Read counts were normalized between samples using TMM (Robinson, M. D. & Oshlack, A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology 11, R25 (2010)). c. Expression values for each gene were inverse normal transformed.
Best wishes, Ramon
Get it, thanks~ Let me have a try. On my side, I am actually not in charge of data pre-processing, but they do mention there is difference between "normalized data" GTEx_Analysis_v8_eQTL_expression_matrices.tar
and "rawer" count data. Let me have another try.
Thanks
Hi Ramon, Have you by any chance, set "missing components" to be zero?
Hi Yezheng,
Yes, missing components are set to zero.
On Tue, 21 Jul 2020, 03:28 Yezheng Li, notifications@github.com wrote:
Hi Ramon, Have you by any chance, set "missing components" to be zero?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/rvinas/GAIN-GTEx/issues/2#issuecomment-661533515, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZG74G747MTDPEYU27MOY3R4TVKTANCNFSM4O32HD7A .
Hi Yezheng, Yes, missing components are set to zero. …
(1) by number of samples
, I mean len(sampl_id) == len(tissues) == x.shape[0] == len(symbols)
. I saw comment # nrows=1000
in utils.py) While I filter out complete nan rows (rather than set them zeros), I have at most 14,442 rows.
(2) I am not an expert in tensorflow; in pytorch with much smaller neural network, I can plug in 14,442 rows (actually, I can tensorize all 14,442 rows and plug into neural network in pytorch) without any difficulty...
(3) I can merely plug in at most 118 rows for 1,732 "protein-coding"genes in chrom-1 only (rather than 12,557 unique human genes). Otherwise will see bugs like the following: (NO INTENTION to trouble you by bugs, just for the completeness of my question itself)
[gtex_gain] x.shape (119, 1732) len(symbols) 119 len(sampl_ids) 119 len(tissues) 119
SEX AGE DTHHRDY
SUBJID
GTEX-1117F 2 60-69 4.0
GTEX-111CU 1 50-59 0.0
GTEX-111FC 1 60-69 1.0
GTEX-111VG 1 60-69 3.0
GTEX-111YS 1 60-69 0.0
Cat covs: (119, 3)
Vocab sizes: [2, 5, 32]
2020-07-21 15:48:42.285272: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-07-21 15:48:42.350561: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fb464ed1db0 executing computations on platform Host. Devices:
2020-07-21 15:48:42.350583: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Host, Default Version
[gtex_gain] np.sum(np.sum(x_test)) nan np.sum(np.sum(cat_covs_test)) 439 np.sum(np.sum(num_covs_test)) -37.57211
0%| | 0/10000 [00:00<?, ?it/s]2020-07-21 15:48:45.535110: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Invalid argument: indices[4] = -1 is not in [0, 5)
[[{{node model_1/embedding_1/embedding_lookup}}]]
0%| | 0/10000 [00:02<?, ?it/s]
Traceback (most recent call last):
File "gtex_gain.py", line 492, in <module>
save_fn=save_fn)
File "gtex_gain.py", line 287, in train
disc_loss = train_disc(x, z, cc, nc, mask, hint, b, gen, disc, disc_opt)
File "/Users/yezheng/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 457, in __call__
result = self._call(*args, **kwds)
File "/Users/yezheng/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 487, in _call
return self._stateless_fn(*args, **kwds) # pylint: disable=not-callable
File "/Users/yezheng/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1823, in __call__
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
File "/Users/yezheng/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1141, in _filtered_call
self.captured_inputs)
File "/Users/yezheng/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1224, in _call_flat
ctx, args, cancellation_manager=cancellation_manager)
File "/Users/yezheng/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 511, in call
ctx=ctx)
File "/Users/yezheng/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[4] = -1 is not in [0, 5)
[[node model_1/embedding_1/embedding_lookup (defined at /Users/yezheng/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1751) ]] [Op:__inference_train_disc_1782]
Function call stack:
train_disc
With <=118 number of samples, I can run your code smoothly with output
[gtex_gain] x.shape (118, 1732) len(symbols) 118 len(sampl_ids) 118 len(tissues) 118
SEX AGE DTHHRDY
SUBJID
GTEX-1117F 2 60-69 4.0
GTEX-111CU 1 50-59 0.0
GTEX-111FC 1 60-69 1.0
GTEX-111VG 1 60-69 3.0
GTEX-111YS 1 60-69 0.0
Cat covs: (118, 3)
Vocab sizes: [2, 5, 32]
2020-07-21 15:49:10.820407: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-07-21 15:49:10.862897: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fa9cb0dd390 executing computations on platform Host. Devices:
2020-07-21 15:49:10.862917: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Host, Default Version
[gtex_gain] np.sum(np.sum(x_test)) nan np.sum(np.sum(cat_covs_test)) 452 np.sum(np.sum(num_covs_test)) -37.335236
0%| | 0/10000 [00:00<?, ?it/s]Score: 0.451
Epoch 1. Gen loss: 1.71. Sup loss: 1.06. Disc loss: 0.62
0%| | 10/10000 [00:07<1:07:19, 2.47it/s]Score: 8.228
Epoch 11. Gen loss: 2.81. Sup loss: 0.81. Disc loss: 0.42
0%| | 20/10000 [00:09<35:23, 4.70it/s]Score: 2.513
Epoch 21. Gen loss: 3.16. Sup loss: 0.74. Disc loss: 0.39
0%|▏ | 30/10000 [00:11<33:22, 4.98it/s]Score: 0.792
Epoch 31. Gen loss: 3.60. Sup loss: 0.99. Disc loss: 0.39
0%|▏ | 40/10000 [00:13<31:56, 5.20it/s]Score: 1.371
Epoch 41. Gen loss: 4.33. Sup loss: 1.14.
Hi Yezheng, Yes, missing components are set to zero. …
Is there any max number of samples you can plug into load_gtex(...) as well as gtex_main.py?
(1) by
number of samples
, I meanlen(sampl_id) == len(tissues) == x.shape[0] == len(symbols)
. I saw comment# nrows=1000
in utils.py) While I filter out complete nan rows (rather than set them zeros), I have at most 14,442 rows. (2) I am not an expert in tensorflow; in pytorch with much smaller neural network, I can plug in 14,442 rows (actually, I can tensorize all 14,442 rows and plug into neural network in pytorch) without any difficulty... (3) I can merely plug in at most 118 rows for 1,732 "protein-coding"genes in chrom-1 only (rather than 12,557 unique human genes). Otherwise will see bugs like the following: (NO INTENTION to trouble you by bugs, just for the completeness of my question itself)[gtex_gain] x.shape (119, 1732) len(symbols) 119 len(sampl_ids) 119 len(tissues) 119 SEX AGE DTHHRDY SUBJID GTEX-1117F 2 60-69 4.0 GTEX-111CU 1 50-59 0.0 GTEX-111FC 1 60-69 1.0 GTEX-111VG 1 60-69 3.0 GTEX-111YS 1 60-69 0.0 Cat covs: (119, 3) Vocab sizes: [2, 5, 32] 2020-07-21 15:48:42.285272: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2020-07-21 15:48:42.350561: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fb464ed1db0 executing computations on platform Host. Devices: 2020-07-21 15:48:42.350583: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Host, Default Version [gtex_gain] np.sum(np.sum(x_test)) nan np.sum(np.sum(cat_covs_test)) 439 np.sum(np.sum(num_covs_test)) -37.57211 0%| | 0/10000 [00:00<?, ?it/s]2020-07-21 15:48:45.535110: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Invalid argument: indices[4] = -1 is not in [0, 5) [[{{node model_1/embedding_1/embedding_lookup}}]] 0%| | 0/10000 [00:02<?, ?it/s] Traceback (most recent call last): File "gtex_gain.py", line 492, in <module> save_fn=save_fn) File "gtex_gain.py", line 287, in train disc_loss = train_disc(x, z, cc, nc, mask, hint, b, gen, disc, disc_opt) File "/Users/yezheng/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 457, in __call__ result = self._call(*args, **kwds) File "/Users/yezheng/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 487, in _call return self._stateless_fn(*args, **kwds) # pylint: disable=not-callable File "/Users/yezheng/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1823, in __call__ return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access File "/Users/yezheng/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1141, in _filtered_call self.captured_inputs) File "/Users/yezheng/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1224, in _call_flat ctx, args, cancellation_manager=cancellation_manager) File "/Users/yezheng/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 511, in call ctx=ctx) File "/Users/yezheng/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute six.raise_from(core._status_to_exception(e.code, message), None) File "<string>", line 3, in raise_from tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[4] = -1 is not in [0, 5) [[node model_1/embedding_1/embedding_lookup (defined at /Users/yezheng/miniconda3/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1751) ]] [Op:__inference_train_disc_1782] Function call stack: train_disc
With <=118 number of samples, I can run your code smoothly with output
[gtex_gain] x.shape (118, 1732) len(symbols) 118 len(sampl_ids) 118 len(tissues) 118 SEX AGE DTHHRDY SUBJID GTEX-1117F 2 60-69 4.0 GTEX-111CU 1 50-59 0.0 GTEX-111FC 1 60-69 1.0 GTEX-111VG 1 60-69 3.0 GTEX-111YS 1 60-69 0.0 Cat covs: (118, 3) Vocab sizes: [2, 5, 32] 2020-07-21 15:49:10.820407: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2020-07-21 15:49:10.862897: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fa9cb0dd390 executing computations on platform Host. Devices: 2020-07-21 15:49:10.862917: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Host, Default Version [gtex_gain] np.sum(np.sum(x_test)) nan np.sum(np.sum(cat_covs_test)) 452 np.sum(np.sum(num_covs_test)) -37.335236 0%| | 0/10000 [00:00<?, ?it/s]Score: 0.451 Epoch 1. Gen loss: 1.71. Sup loss: 1.06. Disc loss: 0.62 0%| | 10/10000 [00:07<1:07:19, 2.47it/s]Score: 8.228 Epoch 11. Gen loss: 2.81. Sup loss: 0.81. Disc loss: 0.42 0%| | 20/10000 [00:09<35:23, 4.70it/s]Score: 2.513 Epoch 21. Gen loss: 3.16. Sup loss: 0.74. Disc loss: 0.39 0%|▏ | 30/10000 [00:11<33:22, 4.98it/s]Score: 0.792 Epoch 31. Gen loss: 3.60. Sup loss: 0.99. Disc loss: 0.39 0%|▏ | 40/10000 [00:13<31:56, 5.20it/s]Score: 1.371 Epoch 41. Gen loss: 4.33. Sup loss: 1.14.
Resolved. This is due to np.nan
value appeared in the new column DTHHRDY
of new GTEx_Analysis_v8_Annotations_SubjectPhenotypesDS.txt
(GTEx portal). np.nan
can be better handled than merely df_metadata[cat_cols] = df_metadata[cat_cols].astype('category')
(which will ignore np.nan
).
Hi Ramon,
(a)
x, symbols, sampl_ids, tissues =load_gtex(corrected=CORRECTED)
in gtex_gain.py as well asreturn pd.read_csv(file, index_col=0) # nrows=1000
in utils.py seems reading meta data as well?(b) I think from several
df = pd.read_csv('{}/{}{}.csv'.format(data_dir, prefix, tissue)
in utils.py, you have for each tissue, a matrix? (like a matrix of gene and sample_id), correct?I wish I could get samples of dataset (only sample would be fine in case csvs are large...) in order to run through
python gtex_gain.py
...Best, Yezheng