melobio / LOGO

MIT License
19 stars 3 forks source link

There is non-GPU devices in `tf.distribute.Strategy`, not using nccl allreduce #1

Open Shenying71 opened 2 years ago

Shenying71 commented 2 years ago

The following have been reloaded with a version change: 1) gcc/7.2.0 => gcc/7.1.0

Using TensorFlow backend. WARNING:tensorflow:There is non-GPU devices in tf.distribute.Strategy, not using nccl allreduce. WARNING:tensorflow:From ../bgi/bert4keras/models.py:179: The name tf.keras.initializers.TruncatedNormal is deprecated. Please use tf.compat.v1.keras.initializers.TruncatedNormal instead.

WARNING:tensorflow:From /risapps/rhel7/python/3.7.3/lib/python3.7/site-packages/tensorflow_core/python/keras/initializers.py:94: calling TruncatedNormal.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /risapps/rhel7/python/3.7.3/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.init (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version. Instructions for updating: If using Keras pass *_constraint arguments to layers. WARNING:tensorflow:From /risapps/rhel7/python/3.7.3/lib/python3.7/site-packages/tensorflow_core/python/ops/math_grad.py:1424: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where Traceback (most recent call last): File "02_train_gene_transformer_lm_hg_bert4keras_tfrecord.py", line 339, in verbose=1 File "/risapps/rhel7/python/3.7.3/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 727, in fit use_multiprocessing=use_multiprocessing) File "/risapps/rhel7/python/3.7.3/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_distributed.py", line 685, in fit steps_name='steps_per_epoch') File "/risapps/rhel7/python/3.7.3/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 300, in model_iteration batch_outs = f(actual_inputs) File "/risapps/rhel7/python/3.7.3/lib/python3.7/site-packages/tensorflow_core/python/keras/backend.py", line 3476, in call run_metadata=self.run_metadata) File "/risapps/rhel7/python/3.7.3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1472, in call run_metadata_ptr) tensorflow.python.framework.errors_impl.InvalidArgumentError: {{function_node __inference_Dataset_map_parse_function_2564}} Key: sequence. Can't parse serialized Example. [[{{node ParseSingleExample/ParseSingleExample}}]] [[MultiDeviceIteratorGetNextFromShard]] [[RemoteCall]] [[IteratorGetNext]]

Licko0909 commented 2 years ago

Hi, Shenying71. I suggest that a)you can check the format of the input file (TFRecord) to see if it contains the "sequence" key,because ”Key: sequence. Can't parse serialized Example.“ b)Check whether the environments are consistent (tensorflow-GPU =2.0) and whether the devices support Gpus

Shenying71 commented 2 years ago

Hi Licko0909, Thanks so much for your reply! I don't think the issue I reported before was due to the two reasons you suggested. Are there any other reasons?

a) Actually I printed out the TRRecord and it contains "sequence" or "masked_sequence".
import tensorflow as tf

filenames = ["hg19_seq_gram_5_stride_1_slice_200000_0_1_2_3_4.tfrecord"] raw_dataset = tf.data.TFRecordDataset(filenames) raw_dataset

for raw_record in raw_dataset.take(100): example = tf.train.Example() example.ParseFromString(raw_record.numpy()) print(example)

features { feature { key: "masked_sequence" value { int64_list { value: 3107 value: 2 value: 2082 value: 2127 ... value: 2 value: 2634 value: 1763 value: 2383 value: 2631 value: 1809 value: 1773 value: 1320 value: 2784 value: 2892 value: 1262 value: 1259 value: 1519 value: 2289 value: 3130 } } } feature { key: "sequence" value { int64_list { value: 3107 value: 1269 value: 2082 ... value: 3030 value: 3063 value: 2634 value: 1763 value: 2383 value: 2631 value: 1809 value: 1773 value: 1320 value: 2784 value: 2892 value: 1262 value: 1259 value: 1519 value: 2289 value: 3130 } } } }

b) Our Singularity container for TensorFlow does have version 2.0.0:

  $ module load tensorflow
  Tensorflow is set up as a singularity container.
 Use the variable $tf to run our python script.
 For example:
 $tf /full/path/to/your/pythonsrcipt
 at this moment only your /rsrch3/home directory is mounted.

  [emsisson@ldragon1 ~]$ module list

  Currently Loaded Modules:
    1) cuda11.1/toolkit/11.1.1   3) singularity/3.7.0
    2) gcc/7.1.0                 4) tensorflow/20191220-gpu
Licko0909 commented 2 years ago

First

System

Ubuntu 18.04 gcc 7.5.0

Conda environment

cudatoolkit 10.0.130 0 defaults cudnn 7.6.5 cuda10.0_0 defaults ... keras 2.3.1 0 defaults keras-applications 1.0.8 py_1 defaults keras-base 2.3.1 py36_0 defaults keras-preprocessing 1.1.2 pyhd3eb1b0_0 defaults pandas 1.1.5 py36ha9443f7_0 defaults python 3.6.9 h265db76_0 defaults ... tensorflow 2.0.0 gpu_py36h6b29c10_0 defaults tensorflow-base 2.0.0 gpu_py36h0ec5d1f_0 defaults tensorflow-estimator 2.0.0 pyh2649769_0 defaults tensorflow-gpu 2.0.0 h0d30ee6_0 defaults

Second

1) Close the backend eager, make the following change: file: LOGO/bgi/bert4keras/backend.py change: image

2) Make the following change: file: LOGO/01_Pre-training_Model/02_run_genebert_bert4keras_tfrecord_train.sh change: stride=1 -> stride=5

The following is an example of the implementation process: image

Shenying71 commented 2 years ago

Great. Thanks so much Licko0909! The program works now. As I know most DNA transformer projects use stride =1 for k mers(3,4,5,6,7). If we use stride=5 for 5 mers(ngram=5), we may not capture sufficient DNA sequence information compared with using stride=1 ? Can you fix the codes by keeping stride=1? Shenying

Shenying71 commented 2 years ago

Hello

Could you please address the following questions (in red text)?

In your LOGO_Variant_Priortization file,

Can you provide codes to streamline the procedure: 1. Generate .npz file from CAD VCF sequence data? 2. Generate .tfrecord files and models; 3 classfication based on models

  1. Data download, see Data_URL.txt, Save to /data/CADD/

  2. Generate vcf sequence (This file only generated .tfrecord files, no .npz files, were those files directly fed into step 4?)

  1. Generate tfrecord, about 30 G (The program has no .npz file to input)
  1. Perform training and prediction (Where does the .hdf5 come from? Pre-trained model or from 01_pre-trained model or step 3 above? )
    • python 02_cadd_classification_transformer_tfrecord.py

From: Licko0909 @.> Sent: Saturday, October 9, 2021 2:25 AM To: melobio/LOGO @.> Cc: Fang,Shenying @.>; Author @.> Subject: [EXT] Re: [melobio/LOGO] There is non-GPU devices in tf.distribute.Strategy, not using nccl allreduce (#1)

WARNING: This email originated from outside of MD Anderson. Please validate the sender's email address before clicking on links or attachments as they may not be safe.

First System

Ubuntu 18.04 gcc 7.5.0

Conda environment

cudatoolkit 10.0.130 0 defaults cudnn 7.6.5 cuda10.0_0 defaults ... keras 2.3.1 0 defaults keras-applications 1.0.8 py_1 defaults keras-base 2.3.1 py36_0 defaults keras-preprocessing 1.1.2 pyhd3eb1b0_0 defaults pandas 1.1.5 py36ha9443f7_0 defaults python 3.6.9 h265db76_0 defaults ... tensorflow 2.0.0 gpu_py36h6b29c10_0 defaults tensorflow-base 2.0.0 gpu_py36h0ec5d1f_0 defaults tensorflow-estimator 2.0.0 pyh2649769_0 defaults tensorflow-gpu 2.0.0 h0d30ee6_0 defaults

Second

  1. Close the backend eager, make the following change: file: LOGO/bgi/bert4keras/backend.py change: [image]https://urldefense.com/v3/__https:/user-images.githubusercontent.com/27897166/136648306-a237dae1-b624-4cd5-a325-6121a2e5c1c8.png__;!!PfbeBCCAmug!1pn5LCiU2osqjVLeYl9MnRjBwzHLajDCHbWfV2YbRV1jUR119LyaYODn4dqPTYaZ$
  2. Make the following change: file: LOGO/01_Pre-training_Model/02_run_genebert_bert4keras_tfrecord_train.sh change: stride=1 -> stride=5

The following is an example of the implementation process: [image]https://urldefense.com/v3/__https:/user-images.githubusercontent.com/27897166/136648587-5dfa2890-aea7-494b-8aee-296485ce1cb7.png__;!!PfbeBCCAmug!1pn5LCiU2osqjVLeYl9MnRjBwzHLajDCHbWfV2YbRV1jUR119LyaYODn4Ut0jo7O$

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https:/github.com/melobio/LOGO/issues/1*issuecomment-939246764__;Iw!!PfbeBCCAmug!1pn5LCiU2osqjVLeYl9MnRjBwzHLajDCHbWfV2YbRV1jUR119LyaYODn4XUuNGiu$, or unsubscribehttps://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/ARM7OGA27QJZIQQVYMDHS3DUF7U6PANCNFSM5FKGBXQA__;!!PfbeBCCAmug!1pn5LCiU2osqjVLeYl9MnRjBwzHLajDCHbWfV2YbRV1jUR119LyaYODn4d3zUhvW$. Triage notifications on the go with GitHub Mobile for iOShttps://urldefense.com/v3/__https:/apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675__;!!PfbeBCCAmug!1pn5LCiU2osqjVLeYl9MnRjBwzHLajDCHbWfV2YbRV1jUR119LyaYODn4cnrKXPy$ or Androidhttps://urldefense.com/v3/__https:/play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign*3Dnotification-email*26utm_medium*3Demail*26utm_source*3Dgithub__;JSUlJSU!!PfbeBCCAmug!1pn5LCiU2osqjVLeYl9MnRjBwzHLajDCHbWfV2YbRV1jUR119LyaYODn4fdIRqw0$.

The information contained in this e-mail message may be privileged, confidential, and/or protected from disclosure. This e-mail message may contain protected health information (PHI); dissemination of PHI should comply with applicable federal and state laws. If you are not the intended recipient, or an authorized representative of the intended recipient, any further review, disclosure, use, dissemination, distribution, or copying of this message or any attachment (or the information contained therein) is strictly prohibited. If you think that you have received this e-mail message in error, please notify the sender by return e-mail and delete all references to it and its contents from your systems.

Shenying71 commented 2 years ago

Hello,

When I ran the model “02_cadd_classification_transformer_tfrecord.py” for cadd data, I used the following data and ran into error messages: can’t parse serialized example. Can you provide help?

train_slice_files = [ '/rsrch3/home/surgonc_rsrch/sfang/LOGO/LOGO-master/05_LOGO_Variant_Prioritization/CADD/GRCh37/SNVS1/humanDerived_SNVs_gram_5_stride_1_slice_100000_100000_train.tfrecord','/rsrch3/home/surgonc_rsrch/sfang/LOGO/LOGO-master/05_LOGO_Variant_Prioritization/CADD/GRCh37/SNVS1/humanDerived_SNVs_gram_5_stride_1_slice_200000_200000_train.tfrecord' ] valid_slice_files = [ '/rsrch3/home/surgonc_rsrch/sfang/LOGO/LOGO-master/05_LOGO_Variant_Prioritization/CADD/GRCh37/SNVS1/humanDerived_SNVs_gram_5_stride_1_slice_100000_100000_valid.tfrecord' ] test_slice_files = [ '/rsrch3/home/surgonc_rsrch/sfang/LOGO/LOGO-master/05_LOGO_Variant_Prioritization/CADD/GRCh37/SNVS1/humanDerived_SNVs_gram_5_stride_1_slice_100000_100000_test.tfrecord' ]

        train_total_size =180000
        valid_total_size = 5000
        test_total_size = 5000

I ran into the following errors:

Train on 175 steps, validate on 2 steps Epoch 1/100 2021-10-25 21:11:09.434153: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: alt_seq. Can't parse serialized Example. 2021-10-25 21:11:09.434270: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: alt_seq. Can't parse serialized Example. Traceback (most recent call last): File "02_cadd_classification_transformer_tfrecord.py", line 278, in verbose=1) File "/risapps/rhel7/python/3.7.3/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 727, in fit 2021-10-25 21:11:09.434995: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: alt_type. Can't parse serialized Example. 2021-10-25 21:11:09.435078: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: seq. Can't parse serialized Example. 2021-10-25 21:11:09.435134: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: alt_seq. Can't parse serialized Example. 2021-10-25 21:11:09.435209: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: alt_type. Can't parse serialized Example. 2021-10-25 21:11:09.435261: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: alt_seq. Can't parse serialized Example. use_multiprocessing=use_multiprocessing) File "/risapps/rhel7/python/3.7.3/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 675, in fit 2021-10-25 21:11:09.435613: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: alt_type. Can't parse serialized Example. 2021-10-25 21:11:09.435697: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: seq. Can't parse serialized Example. steps_name='steps_per_epoch') File "/risapps/rhel7/python/3.7.3/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 300, in model_iteration 2021-10-25 21:11:09.436015: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: alt_seq. Can't parse serialized Example. batch_outs = f(actual_inputs) File "/risapps/rhel7/python/3.7.3/lib/python3.7/site-packages/tensorflow_core/python/keras/backend.py", line 3476, in call 2021-10-25 21:11:09.436259: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: alt_seq. Can't parse serialized Example. 2021-10-25 21:11:09.436336: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: alt_seq. Can't parse serialized Example. run_metadata=self.run_metadata) File "/risapps/rhel7/python/3.7.3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1472, in call 2021-10-25 21:11:09.437045: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: alt_type. Can't parse serialized Example. 2021-10-25 21:11:09.437105: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: alt_type. Can't parse serialized Example. 2021-10-25 21:11:09.437173: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: alt_seq. Can't parse serialized Example. 2021-10-25 21:11:09.437225: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: seq. Can't parse serialized Example. run_metadata_ptr) tensorflow.python.framework.errors_impl.InvalidArgumentError: {{function_node __inference_Dataset_map_single_example_parser_1272}} Key: alt_seq. Can't parse serialized Example. [[{{node ParseSingleExample/ParseSingleExample}}]] [[IteratorGetNext]] 2021-10-25 21:11:09.452867: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: seq. Can't parse serialized Example. 2021-10-25 21:11:09.452934: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at example_parsing_ops.cc:240 : Invalid argument: Key: alt_type. Can't parse serialized Example.

Sent: Saturday, October 9, 2021 2:25 AM To: melobio/LOGO @.> Cc: Fang,Shenying @.>; Author @.***> Subject: [EXT] Re: [melobio/LOGO] There is non-GPU devices in tf.distribute.Strategy, not using nccl allreduce (#1)

WARNING: This email originated from outside of MD Anderson. Please validate the sender's email address before clicking on links or attachments as they may not be safe.

First System

Ubuntu 18.04 gcc 7.5.0

Conda environment

cudatoolkit 10.0.130 0 defaults cudnn 7.6.5 cuda10.0_0 defaults ... keras 2.3.1 0 defaults keras-applications 1.0.8 py_1 defaults keras-base 2.3.1 py36_0 defaults keras-preprocessing 1.1.2 pyhd3eb1b0_0 defaults pandas 1.1.5 py36ha9443f7_0 defaults python 3.6.9 h265db76_0 defaults ... tensorflow 2.0.0 gpu_py36h6b29c10_0 defaults tensorflow-base 2.0.0 gpu_py36h0ec5d1f_0 defaults tensorflow-estimator 2.0.0 pyh2649769_0 defaults tensorflow-gpu 2.0.0 h0d30ee6_0 defaults

Second

  1. Close the backend eager, make the following change: file: LOGO/bgi/bert4keras/backend.py change: [image]https://urldefense.com/v3/__https:/user-images.githubusercontent.com/27897166/136648306-a237dae1-b624-4cd5-a325-6121a2e5c1c8.png__;!!PfbeBCCAmug!1pn5LCiU2osqjVLeYl9MnRjBwzHLajDCHbWfV2YbRV1jUR119LyaYODn4dqPTYaZ$
  2. Make the following change: file: LOGO/01_Pre-training_Model/02_run_genebert_bert4keras_tfrecord_train.sh change: stride=1 -> stride=5

The following is an example of the implementation process: [image]https://urldefense.com/v3/__https:/user-images.githubusercontent.com/27897166/136648587-5dfa2890-aea7-494b-8aee-296485ce1cb7.png__;!!PfbeBCCAmug!1pn5LCiU2osqjVLeYl9MnRjBwzHLajDCHbWfV2YbRV1jUR119LyaYODn4Ut0jo7O$

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https:/github.com/melobio/LOGO/issues/1*issuecomment-939246764__;Iw!!PfbeBCCAmug!1pn5LCiU2osqjVLeYl9MnRjBwzHLajDCHbWfV2YbRV1jUR119LyaYODn4XUuNGiu$, or unsubscribehttps://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/ARM7OGA27QJZIQQVYMDHS3DUF7U6PANCNFSM5FKGBXQA__;!!PfbeBCCAmug!1pn5LCiU2osqjVLeYl9MnRjBwzHLajDCHbWfV2YbRV1jUR119LyaYODn4d3zUhvW$. Triage notifications on the go with GitHub Mobile for iOShttps://urldefense.com/v3/__https:/apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675__;!!PfbeBCCAmug!1pn5LCiU2osqjVLeYl9MnRjBwzHLajDCHbWfV2YbRV1jUR119LyaYODn4cnrKXPy$ or Androidhttps://urldefense.com/v3/__https:/play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign*3Dnotification-email*26utm_medium*3Demail*26utm_source*3Dgithub__;JSUlJSU!!PfbeBCCAmug!1pn5LCiU2osqjVLeYl9MnRjBwzHLajDCHbWfV2YbRV1jUR119LyaYODn4fdIRqw0$.

The information contained in this e-mail message may be privileged, confidential, and/or protected from disclosure. This e-mail message may contain protected health information (PHI); dissemination of PHI should comply with applicable federal and state laws. If you are not the intended recipient, or an authorized representative of the intended recipient, any further review, disclosure, use, dissemination, distribution, or copying of this message or any attachment (or the information contained therein) is strictly prohibited. If you think that you have received this e-mail message in error, please notify the sender by return e-mail and delete all references to it and its contents from your systems.