Attempting to run my prototype on 3 machines in aws, of type m5.4xlarge.
The TQDM progress bar never moves. Execution appears hanging, also, with progress bar code removed.
Things attempted:
I was able to run the MNIST example with the same MultiWorkerMirroredStrategy approach.
Tried compile with Adagrad vs. Adam. Tried a small batch size such as 192 vs. larger such as 8192. No difference for my prototype.
Can I get any output out of TFRS at this point to tell what might be going wrong? Any other things to try?
Output on the command line:
2021-04-12 21:44:02.417762: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-04-12 21:44:03.666561: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-04-12 21:44:03.667485: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-04-12 21:44:03.739716: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-04-12 21:44:03.739767: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ip-10-2-249-213.awsinternal.audiomack.com): /proc/driver/nvidia/version does not exist
2021-04-12 21:44:03.740798: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-04-12 21:44:03.741015: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-04-12 21:44:03.741571: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-04-12 21:44:03.745589: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker -> {0 -> 10.2.249.213:2121, 1 -> 10.2.252.56:2121, 2 -> 10.2.252.97:2121}
2021-04-12 21:44:03.745902: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:411] Started server with target: grpc://10.2.249.213:2121
>> 2021-04-12 16:44:07 : >> Running the prototype...
>> Initializing TfrsModelMaker...
>> items_path : s3://my-bucket/recsys-tf/temp-data/20210411173323/items
>> users_path : s3://my-bucket/recsys-tf/temp-data/20210411173323/users
>> events_path : s3://my-bucket/recsys-tf/temp-data/20210411173323/events
>> num_items : 100
>> num_users : 100
>> num_events : 100
2021-04-12 21:44:08.693713: W tensorflow_io/core/kernels/audio_video_mp3_kernels.cc:271] libmp3lame.so.0 or lame functions are not available
2021-04-12 21:44:08.693900: I tensorflow_io/core/kernels/cpu_check.cc:128] Your CPU supports instructions that this TensorFlow IO binary was not compiled to use: AVX2 AVX512F FMA
>> Strategy: <tensorflow.python.distribute.collective_all_reduce_strategy.CollectiveAllReduceStrategy object at 0x7f8ab9d67690>
>> 2021-04-12 16:44:09 : >> Training the model...
2021-04-12 21:44:09.449931: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-04-12 21:44:09.467194: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2499995000 Hz
0%| | 0/3 [00:00<?, ?epoch/s]
0.00batch [00:00, ?batch/s]
Setup:
The TQDM progress bar never moves. Execution appears hanging, also, with progress bar code removed.
Things attempted:
compile
with Adagrad vs. Adam. Tried a small batch size such as 192 vs. larger such as 8192. No difference for my prototype.Can I get any output out of TFRS at this point to tell what might be going wrong? Any other things to try?
Output on the command line: