google-research / text-to-text-transfer-transformer

Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"
https://arxiv.org/abs/1910.10683
Apache License 2.0
6.16k stars 756 forks source link

Example script for WT5 hangs #205

Closed poset closed 4 years ago

poset commented 4 years ago

The example script for WT5 is hanging for me, and I'm not sure why. It can connect to my TPU and downloads/extracts a dataset, but then appears to do nothing after finishing the extraction (TPU shows 0.1% usage, no further prints).

I set it to 2x2 tpu topology on a v3-8, and batch size at 2048 tokens.

Script shown, then console output, then console output upon hitting ctrl+c. If there's a better way to format this please let me know. Note that T5 works for me with the default config in the T5 readme.

Script based on provided script in WT5 readme

`export PROJECT=my-project-name export ZONE=us-central1-b export BUCKET=gs://my-bucket-name export TPU=node-2

ctpu up --name=$TPU --project=$PROJECT --zone=$ZONE --tpu-size=v3-8 --tpu-only --noconf

TASK=movie_rationales_explanations_take1000_v010 PRETRAINED_DIR=gs://t5-data/pretrained_models/small PRETRAINED_STEPS=1000000 FINETUNE_STEPS=20000 MODEL_DIR="${BUCKET}/${TASK}"

t5_mesh_transformer \ --tpu="${TPU}" \ --gcp_project="${PROJECT}" \ --tpu_zone="${ZONE}" \ --model_dir="${MODEL_DIR}" \ --gin_file="dataset.gin" \ --gin_file="${PRETRAINED_DIR}/operative_config.gin" \ --gin_file="wt5/gin/sequence_lengths/movie_rationales_v010.gin" \ --gin_param="utils.tpu_mesh_shape.tpu_topology = '2x4'" \ --gin_param="MIXTURE_NAME = '${TASK}'" \ --gin_param="mesh_train_dataset_fn.use_cached=False" \ --gin_param="utils.run.save_checkpoints_steps=100" \ --gin_param="utils.run.batch_size=('tokens_per_batch', 16384)" \ --gin_param="utils.run.train_steps=$((PRETRAINED_STEPS+FINETUNE_STEPS))" \ --gin_param="utils.run.init_checkpoint='${PRETRAINED_DIR}/model.ckpt-${PRETRAINED_STEPS}'" \ --gin_param="utils.run.learning_rate_schedule=@learning_rate_schedules.constant_learning_rate" \ --gin_param="constant_learning_rate.learning_rate=1e-3" \ --t5_tfds_data_dir="${BUCKET}/t5-tfds" \ --module_import="wt5.tasks" \ --module_import="wt5.mixtures" \ --gin_location_prefix="wt5/wt5/gin/" `

Console Output.

(Note it shows no sign of any activity at the end, and doesn't return to bash)

:~/google-research/wt5$ ./testingWT5scriptGCP_TPU 2020/05/03 02:05:37 TPU already running. Operation success; not ssh-ing to GCE VM due to --tpu-only flag. WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow_core/python/compat/v2_compat.py:88: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term INFO:tensorflow:model_type=bitransformer I0503 02:05:43.398267 140562852878144 utils.py:1685] model_type=bitransformer INFO:tensorflow:mode=train I0503 02:05:43.398564 140562852878144 utils.py:1686] mode=train INFO:tensorflow:sequence_length={'inputs': 2048, 'targets': 512} I0503 02:05:43.398667 140562852878144 utils.py:1687] sequence_length={'inputs': 2048, 'targets': 512} INFO:tensorflow:batch_size=8 I0503 02:05:43.398747 140562852878144 utils.py:1688] batch_size=8 INFO:tensorflow:train_steps=1020000 I0503 02:05:43.398822 140562852878144 utils.py:1689] train_steps=1020000 INFO:tensorflow:mesh_shape=Shape[batch=16] I0503 02:05:43.398901 140562852878144 utils.py:1690] mesh_shape=Shape[batch=16] INFO:tensorflow:layout_rules=ensemble:ensemble,batch:batch,d_ff:model,heads:model,vocab:model,experts:batch I0503 02:05:43.398976 140562852878144 utils.py:1691] layout_rules=ensemble:ensemble,batch:batch,d_ff:model,heads:model,vocab:model,experts:batch INFO:tensorflow:Building TPUConfig with tpu_job_name=None I0503 02:05:43.403095 140562852878144 utils.py:1706] Building TPUConfig with tpu_job_name=None I0503 02:05:43.406050 140562852878144 discovery.py:280] URL being requested: GET https://www.googleapis.com/discovery/v1/apis/tpu/v1/rest I0503 02:05:43.441212 140562852878144 discovery.py:911] URL being requested: GET https://tpu.googleapis.com/v1/projects/t5-ignition/locations/us-central1-b/nodes/node-2?alt=json I0503 02:05:43.441443 140562852878144 transport.py:151] Attempting refresh to obtain initial access_token I0503 02:05:43.503969 140562852878144 discovery.py:280] URL being requested: GET https://www.googleapis.com/discovery/v1/apis/tpu/v1/rest I0503 02:05:43.534800 140562852878144 discovery.py:911] URL being requested: GET https://tpu.googleapis.com/v1/projects/t5-ignition/locations/us-central1-b/nodes/node-2?alt=json I0503 02:05:43.535020 140562852878144 transport.py:151] Attempting refresh to obtain initial access_token INFO:tensorflow:Using config: {'_model_dir': 'gs://t5-ignition-bucket/movie_rationales_explanations_take1000_v010', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true cluster_def { job { name: "worker" tasks { key: 0 value: "10.9.81.234:8470" } } } isolate_session_state: true , '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({'worker': ['10.9.81.234:8470']}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': 'grpc://10.9.81.234:8470', '_evaluation_master': 'grpc://10.9.81.234:8470', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=100, num_shards=None, num_cores_per_replica=1, per_host_input_for_training=4, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_configuration=2, experimental_host_call_every_n_steps=1), '_cluster': <tensorflow.python.distribute.cluster_resolver.tpu_cluster_resolver.TPUClusterResolver object at 0x7fd6bce7b908>} I0503 02:05:43.584227 140562852878144 estimator.py:216] Using config: {'_model_dir': 'gs://t5-ignition-bucket/movie_rationales_explanations_take1000_v010', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true cluster_def { job { name: "worker" tasks { key: 0 value: "10.9.81.234:8470" } } } isolate_session_state: true , '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({'worker': ['10.9.81.234:8470']}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': 'grpc://10.9.81.234:8470', '_evaluation_master': 'grpc://10.9.81.234:8470', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=100, num_shards=None, num_cores_per_replica=1, per_host_input_for_training=4, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_configuration=2, experimental_host_call_every_n_steps=1), '_cluster': <tensorflow.python.distribute.cluster_resolver.tpu_cluster_resolver.TPUClusterResolver object at 0x7fd6bce7b908>} INFO:tensorflow:_TPUContext: eval_on_tpu True I0503 02:05:43.584667 140562852878144 tpu_context.py:221] _TPUContext: eval_on_tpu True INFO:tensorflow:Querying Tensorflow master (grpc://10.9.81.234:8470) for TPU system metadata. I0503 02:05:43.709766 140562852878144 tpu_system_metadata.py:72] Querying Tensorflow master (grpc://10.9.81.234:8470) for TPU system metadata. 2020-05-03 02:05:43.711139: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:373] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created. INFO:tensorflow:Initializing TPU system (master: grpc://10.9.81.234:8470) to fetch topology for model parallelism. This might take a while. I0503 02:05:43.716908 140562852878144 tpu_system_metadata.py:157] Initializing TPU system (master: grpc://10.9.81.234:8470) to fetch topology for model parallelism. This might take a while. INFO:tensorflow:Found TPU system: I0503 02:05:49.398839 140562852878144 tpu_system_metadata.py:140] Found TPU system: INFO:tensorflow:*** Num TPU Cores: 8 I0503 02:05:49.399126 140562852878144 tpu_system_metadata.py:141] *** Num TPU Cores: 8 INFO:tensorflow:*** Num TPU Workers: 1 I0503 02:05:49.399248 140562852878144 tpu_system_metadata.py:142] *** Num TPU Workers: 1 INFO:tensorflow:*** Num TPU Cores Per Worker: 8 I0503 02:05:49.399342 140562852878144 tpu_system_metadata.py:144] *** Num TPU Cores Per Worker: 8 INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, -1, 543705933573447767) I0503 02:05:49.399434 140562852878144 tpu_system_metadata.py:146] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, -1, 543705933573447767) INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 9459511554632651815) I0503 02:05:49.399722 140562852878144 tpu_system_metadata.py:146] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 9459511554632651815) INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 10405845867630879762) I0503 02:05:49.399811 140562852878144 tpu_system_metadata.py:146] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 10405845867630879762) INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, 1519635725069975015) I0503 02:05:49.399931 140562852878144 tpu_system_metadata.py:146] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, 1519635725069975015) INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 17179869184, 16532467575488418548) I0503 02:05:49.400012 140562852878144 tpu_system_metadata.py:146] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 17179869184, 16532467575488418548) INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 17179869184, 9988226954971163265) I0503 02:05:49.400092 140562852878144 tpu_system_metadata.py:146] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 17179869184, 9988226954971163265) INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 17179869184, 2095320529623114246) I0503 02:05:49.400168 140562852878144 tpu_system_metadata.py:146] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 17179869184, 2095320529623114246) INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 17179869184, 5421484898729298314) I0503 02:05:49.400383 140562852878144 tpu_system_metadata.py:146] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 17179869184, 5421484898729298314) INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 17179869184, 10057256119691496601) I0503 02:05:49.400545 140562852878144 tpu_system_metadata.py:146] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 17179869184, 10057256119691496601) INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 8589934592, 9720856656554682660) I0503 02:05:49.400689 140562852878144 tpu_system_metadata.py:146] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 8589934592, 9720856656554682660) INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 16735225736129679856) I0503 02:05:49.400792 140562852878144 tpu_system_metadata.py:146] *** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 16735225736129679856) WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1635: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version. Instructions for updating: If using Keras pass *_constraint arguments to layers. W0503 02:05:49.405328 140562852878144 deprecation.py:506] From /usr/local/lib/python3.7/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1635: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version. Instructions for updating: If using Keras pass *_constraint arguments to layers. WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts. W0503 02:05:49.405760 140562852878144 deprecation.py:323] From /usr/local/lib/python3.7/dist-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts. INFO:tensorflow:Calling model_fn. I0503 02:05:49.412624 140562852878144 estimator.py:1151] Calling model_fn. I0503 02:05:49.850056 140562852878144 dataset_info.py:426] Load pre-computed DatasetInfo (eg: splits, num examples,...) from GCS: movie_rationales/0.1.0 I0503 02:05:49.879839 140562852878144 dataset_info.py:357] Load dataset info from /tmp/tmplb_43_hwtfds I0503 02:05:49.881754 140562852878144 dataset_info.py:397] Field info.description from disk and from code do not match. Keeping the one from code. I0503 02:05:49.881891 140562852878144 dataset_info.py:397] Field info.citation from disk and from code do not match. Keeping the one from code. I0503 02:05:50.003428 140562852878144 dataset_builder.py:333] Generating dataset movie_rationales (gs://t5-ignition-bucket/t5-tfds/movie_rationales/0.1.0) Downloading and preparing dataset movie_rationales/0.1.0 (download: 3.72 MiB, generated: Unknown size, total: 3.72 MiB) to gs://t5-ignition-bucket/t5-tfds/movie_rationales/0.1.0... Dl Completed...: 0 url [00:00, ? url/s] I0503 02:05:51.025692 140562852878144 download_manager.py:291] URL http://www.eraserbenchmark.com/zipped/movies.tar.gz already downloaded: reusing gs://t5-ignition-bucket/t5-tfds/downloads/eraserbenchmark.com_zipped_moviesZuGNTmyd-en1VEVysL_pKjlnP3Tsv8OFm0bO2y9bLe4.tar.gz. Dl Completed...: 0 url [00:00, ? url/s] ? file/s] Dl Size...: 0 MiB [00:00, ?MiB/s]`

If I kill the process, it prints this:

INFO:tensorflow:training_loop marked as finished I0503 02:43:25.864410 140402965063488 error_handling.py:108] training_loop marked as finished Traceback (most recent call last): File "/home/eligray2/.local/bin/t5_mesh_transformer", line 8, in <module> sys.exit(console_entry_point()) File "/home/eligray2/.local/lib/python3.7/site-packages/t5/models/mesh_transformer_main.py", line 222, in console_entry_point app.run(main) File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 299, in run _run_main(main, args) File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 250, in _run_main sys.exit(main(argv)) File "/home/eligray2/.local/lib/python3.7/site-packages/t5/models/mesh_transformer_main.py", line 216, in main model_dir=FLAGS.model_dir) File "/usr/local/lib/python3.7/dist-packages/gin/config.py", line 1055, in gin_wrapper return fn(*new_args, **new_kwargs) File "/home/eligray2/.local/lib/python3.7/site-packages/mesh_tensorflow/transformer/utils.py", line 1738, in run train_dataset_fn, train_steps, ensemble_inputs) File "/home/eligray2/.local/lib/python3.7/site-packages/mesh_tensorflow/transformer/utils.py", line 1132, in train_model estimator.train(input_fn=input_fn, max_steps=train_steps) File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3030, in train saving_listeners=saving_listeners) File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 374, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1164, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1194, in _train_model_default features, labels, ModeKeys.TRAIN, self.config) File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2857, in _call_model_fn config) File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1152, in _call_model_fn model_fn_results = self._model_fn(features=features, **kwargs) File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3148, in _model_fn input_holders.generate_infeed_enqueue_ops_and_dequeue_fn()) File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1428, in generate_infeed_enqueue_ops_and_dequeue_f n self._invoke_input_fn_and_record_structure()) File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1483, in _invoke_input_fn_and_record_structure num_hosts)) File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1090, in generate_broadcast_enqueue_ops_fn inputs = _Inputs.from_input_fn(input_fn(user_context)) File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3001, in _input_fn return input_fn(**kwargs) File "/home/eligray2/.local/lib/python3.7/site-packages/mesh_tensorflow/transformer/utils.py", line 1126, in input_fn dataset_split=dataset_split) File "/usr/local/lib/python3.7/dist-packages/gin/config.py", line 1055, in gin_wrapper return fn(*new_args, **new_kwargs) File "/home/eligray2/.local/lib/python3.7/site-packages/t5/models/mesh_transformer.py", line 66, in mesh_train_dataset_fn sequence_length, split=dataset_split, use_cached=use_cached, shuffle=True) File "/home/eligray2/.local/lib/python3.7/site-packages/t5/data/utils.py", line 637, in get_dataset ds = self._dataset_fn(split=split, shuffle_files=shuffle) File "/home/eligray2/.local/lib/python3.7/site-packages/t5/data/utils.py", line 739, in dataset_fn return self._tfds_dataset.load(split, shuffle_files) File "/home/eligray2/.local/lib/python3.7/site-packages/t5/data/utils.py", line 217, in load try_gcs=True) File "/home/eligray2/.local/lib/python3.7/site-packages/tensorflow_datasets/core/api_utils.py", line 69, in disallow_positional_args_dec return fn(*args, **kwargs) File "/home/eligray2/.local/lib/python3.7/site-packages/tensorflow_datasets/core/registered.py", line 369, in load dbuilder.download_and_prepare(**download_and_prepare_kwargs) File "/home/eligray2/.local/lib/python3.7/site-packages/tensorflow_datasets/core/api_utils.py", line 69, in disallow_positional_args_dec return fn(*args, **kwargs) File "/home/eligray2/.local/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_builder.py", line 363, in download_and_prepare download_config=download_config) File "/home/eligray2/.local/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_builder.py", line 1004, in _download_and_prepare max_examples_per_split=download_config.max_examples_per_split, File "/home/eligray2/.local/lib/python3.7/site-packages/tensorflow_datasets/core/dataset_builder.py", line 924, in _download_and_prepare dl_manager, **split_generators_kwargs): File "/home/eligray2/.local/lib/python3.7/site-packages/tensorflow_datasets/text/movie_rationales.py", line 71, in _split_generators dl_dir = dl_manager.download_and_extract(_DOWNLOAD_URL) File "/home/eligray2/.local/lib/python3.7/site-packages/tensorflow_datasets/core/download/download_manager.py", line 419, in download_and_extract return _map_promise(self._download_extract, url_or_urls) File "/home/eligray2/.local/lib/python3.7/site-packages/tensorflow_datasets/core/download/download_manager.py", line 462, in _map_promise res = utils.map_nested(_wait_on_promise, all_promises) File "/home/eligray2/.local/lib/python3.7/site-packages/tensorflow_datasets/core/utils/py_utils.py", line 169, in map_nested return function(data_struct) File "/home/eligray2/.local/lib/python3.7/site-packages/tensorflow_datasets/core/download/download_manager.py", line 446, in _wait_on_promise return p.get() File "/usr/local/lib/python3.7/dist-packages/promise/promise.py", line 511, in get self._wait(timeout or DEFAULT_TIMEOUT) File "/usr/local/lib/python3.7/dist-packages/promise/promise.py", line 506, in _wait self.wait(self, timeout) File "/usr/local/lib/python3.7/dist-packages/promise/promise.py", line 502, in wait async_instance.wait(promise, timeout) File "/usr/local/lib/python3.7/dist-packages/promise/async_.py", line 117, in wait target.scheduler.wait(target, timeout) File "/usr/local/lib/python3.7/dist-packages/promise/schedulers/immediate.py", line 25, in wait waited = e.wait(timeout) File "/usr/lib/python3.7/threading.py", line 552, in wait signaled = self._cond.wait(timeout) File "/usr/lib/python3.7/threading.py", line 296, in wait waiter.acquire() KeyboardInterrupt

adarob commented 4 years ago

Quick thought: it looks like it's getting stuck in TFDS, most likely trying to extract (untar/gunzip) eraser. How long are you waiting? It should only have to extract it once but it will block training until it's done.

poset commented 4 years ago

Yep, that's what it was. It was unexpected behavior of TFDS. It tricked me by being about a 2MB download (0.1sec) that somehow takes my VM/TPU instance a full 15 minutes or so to unpack, all the while showing no updates on the progress bar. Had given up on it early because of paying out of pocket for a TPU. Thanks! Got it training.