google-research / tapas

End-to-end neural table-text understanding models.
Apache License 2.0
1.15k stars 217 forks source link

incremental training/fine tuning #78

Closed gowthamvenkatsairam closed 4 years ago

gowthamvenkatsairam commented 4 years ago

Hi there, when I try to finetune the sqa base model with custom data,the data is being created successfully,but while fine tuning was getting the following error.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term is_built_with_cuda: True I1016 05:53:26.072952 140684024567680 run_task_main.py:152] is_built_with_cuda: True WARNING:tensorflow:From tapas/tapas/run_task_main.py:729: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version. Instructions for updating: Use tf.config.list_physical_devices('GPU') instead. W1016 05:53:26.073178 140684024567680 deprecation.py:323] From tapas/tapas/run_task_main.py:729: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version. Instructions for updating: Use tf.config.list_physical_devices('GPU') instead. 2020-10-16 05:53:26.073460: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2020-10-16 05:53:26.079266: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2300000000 Hz 2020-10-16 05:53:26.079491: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3001480 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2020-10-16 05:53:26.079530: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2020-10-16 05:53:26.081782: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1 2020-10-16 05:53:26.159756: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-10-16 05:53:26.160638: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7c1f500 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2020-10-16 05:53:26.160679: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla K80, Compute Capability 3.7 2020-10-16 05:53:26.160895: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-10-16 05:53:26.161609: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: pciBusID: 0000:00:04.0 name: Tesla K80 computeCapability: 3.7 coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s 2020-10-16 05:53:26.161952: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 2020-10-16 05:53:26.163851: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10 2020-10-16 05:53:26.165499: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10 2020-10-16 05:53:26.165862: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10 2020-10-16 05:53:26.167822: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10 2020-10-16 05:53:26.169243: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10 2020-10-16 05:53:26.173416: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-10-16 05:53:26.173564: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-10-16 05:53:26.174436: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-10-16 05:53:26.175205: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0 2020-10-16 05:53:26.175275: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 2020-10-16 05:53:26.176865: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-10-16 05:53:26.176903: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108] 0 2020-10-16 05:53:26.176945: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0: N 2020-10-16 05:53:26.177126: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-10-16 05:53:26.177883: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-10-16 05:53:26.178654: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0. 2020-10-16 05:53:26.178721: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/device:GPU:0 with 10691 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7) is_gpu_available: True I1016 05:53:26.179650 140684024567680 run_task_main.py:152] is_gpu_available: True 2020-10-16 05:53:26.180133: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-10-16 05:53:26.180882: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: pciBusID: 0000:00:04.0 name: Tesla K80 computeCapability: 3.7 coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s 2020-10-16 05:53:26.180959: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1 2020-10-16 05:53:26.181001: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10 2020-10-16 05:53:26.181042: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10 2020-10-16 05:53:26.181083: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10 2020-10-16 05:53:26.181130: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10 2020-10-16 05:53:26.181167: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10 2020-10-16 05:53:26.181233: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-10-16 05:53:26.181397: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-10-16 05:53:26.182184: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2020-10-16 05:53:26.182841: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0 GPUs: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')] I1016 05:53:26.182999 140684024567680 run_task_main.py:152] GPUs: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')] Training or predicting ... I1016 05:53:26.183153 140684024567680 run_task_main.py:152] Training or predicting ... WARNING:tensorflow:From tapas/tapas/run_task_main.py:399: The name tf.estimator.tpu.InputPipelineConfig is deprecated. Please use tf.compat.v1.estimator.tpu.InputPipelineConfig instead.

W1016 05:53:26.231838 140684024567680 module_wrapper.py:138] From tapas/tapas/run_task_main.py:399: The name tf.estimator.tpu.InputPipelineConfig is deprecated. Please use tf.compat.v1.estimator.tpu.InputPipelineConfig instead.

WARNING:tensorflow:From tapas/tapas/run_task_main.py:409: The name tf.estimator.tpu.RunConfig is deprecated. Please use tf.compat.v1.estimator.tpu.RunConfig instead.

W1016 05:53:26.232146 140684024567680 module_wrapper.py:138] From tapas/tapas/run_task_main.py:409: The name tf.estimator.tpu.RunConfig is deprecated. Please use tf.compat.v1.estimator.tpu.RunConfig instead.

WARNING:tensorflow:From tapas/tapas/run_task_main.py:417: The name tf.estimator.tpu.TPUConfig is deprecated. Please use tf.compat.v1.estimator.tpu.TPUConfig instead.

W1016 05:53:26.232349 140684024567680 module_wrapper.py:138] From tapas/tapas/run_task_main.py:417: The name tf.estimator.tpu.TPUConfig is deprecated. Please use tf.compat.v1.estimator.tpu.TPUConfig instead.

WARNING:tensorflow:From tapas/tapas/run_task_main.py:423: The name tf.estimator.tpu.TPUEstimator is deprecated. Please use tf.compat.v1.estimator.tpu.TPUEstimator instead.

W1016 05:53:26.232740 140684024567680 module_wrapper.py:138] From tapas/tapas/run_task_main.py:423: The name tf.estimator.tpu.TPUEstimator is deprecated. Please use tf.compat.v1.estimator.tpu.TPUEstimator instead.

INFO:tensorflow:Using config: {'_model_dir': 'results/sqa/model', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 1000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } } , '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 4.0, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=1000, num_shards=8, num_cores_per_replica=None, per_host_input_for_training=3, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_configuration=2, experimental_host_call_every_n_steps=1), '_cluster': None} I1016 05:53:26.233565 140684024567680 estimator.py:191] Using config: {'_model_dir': 'results/sqa/model', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 1000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } } , '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 4.0, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=1000, num_shards=8, num_cores_per_replica=None, per_host_input_for_training=3, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_configuration=2, experimental_host_call_every_n_steps=1), '_cluster': None} INFO:tensorflow:_TPUContext: eval_on_tpu True I1016 05:53:26.234669 140684024567680 tpu_context.py:216] _TPUContext: eval_on_tpu True WARNING:tensorflow:eval_on_tpu ignored because use_tpu is False. W1016 05:53:26.235175 140684024567680 tpu_context.py:218] eval_on_tpu ignored because use_tpu is False. Training I1016 05:53:26.235362 140684024567680 run_task_main.py:152] Training INFO:tensorflow:Skipping training since max_steps has already saved. I1016 05:53:26.239876 140684024567680 estimator.py:342] Skipping training since max_steps has already saved. INFO:tensorflow:training_loop marked as finished I1016 05:53:26.240107 140684024567680 error_handling.py:115] training_loop marked as finished

gowthamvenkatsairam commented 4 years ago

training is being skipped due to max_steps.what could be the reason?

muelletm commented 4 years ago

I would think this means there is already a checkpoint with max_steps steps in the model directory.

Can you check that max_steps is > 0 and that the model directory is empty?

gowthamvenkatsairam commented 4 years ago

yes,the model directory is empty,and max_steps=tapas_config.num_train_steps.here is the config file and the code i used to finetune.

tapas_config.json

{ "agg_temperature": 1.0, "aggregation_loss_importance": 1.0, "allow_empty_column_selection": false, "answer_loss_cutoff": null, "answer_loss_importance": 1.0, "average_approximation_function": "ratio", "average_logits_per_cell": false, "bert_config": { "attention_probs_dropout_prob": 0.1, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "max_position_embeddings": 1024, "num_attention_heads": 12, "num_hidden_layers": 12, "softmax_temperature": 1.0, "type_vocab_size": [ 3, 256, 256, 2, 256, 256, 10 ], "vocab_size": 30522 }, "cell_select_pref": null, "disable_per_token_loss": false, "disable_position_embeddings": false, "disabled_features": [], "grad_clipping": null, "huber_loss_delta": null, "init_cell_selection_weights_to_zero": false, "init_checkpoint": "tapas_sqa_base/model.ckpt", "learning_rate": 1.25e-05, "max_num_columns": 32, "max_num_rows": 64, "num_aggregation_labels": 0, "num_classification_labels": 0, "num_train_steps": 200000, "num_warmup_steps": 2000, "positive_weight": 10.0, "reset_position_index_per_cell": false, "select_one_column": true, "span_prediction": "none", "temperature": 1.0, "use_answer_as_supervision": null, "use_gumbel_for_agg": false, "use_gumbel_for_cells": false, "use_normalized_answer_loss": false, "use_tpu": false }

finetuning task

! python tapas/tapas/run_task_main.py \ --task="SQA" \ --output_dir="results" \ --model_dir="ckpoints" \ --init_checkpoint="tapas_sqa_base/model.ckpt" \ --bert_config_file="tapas_sqa_base/bert_config.json" \ --mode="train"

ghost commented 4 years ago

Sorry for the late reply, but this is hard to debug without more details.

Can you share the file structure of the results and ckpoints folders (e.g. the output of "! find results")?

gowthamvenkatsairam commented 4 years ago

I got the solution for that,thanks.

ghost commented 4 years ago

Could you update the issue with the solution? In case someone else runs into a similar issue ...