vanhuyz / CycleGAN-TensorFlow

An implementation of CycleGan using TensorFlow
MIT License
1.19k stars 436 forks source link

CPU 100%, but training never starts #109

Open AmitMY opened 5 years ago

AmitMY commented 5 years ago

I ran:

python3 train.py  \
    --X=data/tfrecords/hands_dirty.tfrecords \
    --Y=data/tfrecords/hands_clean.tfrecords \
    --image_size=368

Which gave me lots of output. Put the CPU on 100%, and I've been waiting to see training steps for an hour now, but training does not seem to start.

Is there a reason for it to be frozen?

(Not only I don't see steps in the terminal, the TensorBoard is empty as well)

2019-07-25 17:44:59.211092: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0 2019-07-25 17:44:59.213841: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0 2019-07-25 17:44:59.216295: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0 2019-07-25 17:44:59.216873: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0 2019-07-25 17:44:59.220049: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0 2019-07-25 17:44:59.222446: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0 2019-07-25 17:44:59.269580: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7 2019-07-25 17:44:59.288028: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1, 2, 3 2019-07-25 17:44:59.288551: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2019-07-25 17:44:59.907510: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x555ff98ec9f0 executing computations on platform CUDA. Devices: 2019-07-25 17:44:59.907590: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1 2019-07-25 17:44:59.907617: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (1): GeForce GTX 1080 Ti, Compute Capability 6.1 2019-07-25 17:44:59.907637: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (2): GeForce GTX 1080 Ti, Compute Capability 6.1 2019-07-25 17:44:59.907657: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (3): GeForce GTX 1080 Ti, Compute Capability 6.1 2019-07-25 17:44:59.913152: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2200045000 Hz 2019-07-25 17:44:59.919999: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x555ff819b190 executing computations on platform Host. Devices: 2019-07-25 17:44:59.920055: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): , 2019-07-25 17:44:59.926804: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.62 pciBusID: 0000:02:00.0 2019-07-25 17:44:59.929084: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.62 pciBusID: 0000:03:00.0 2019-07-25 17:44:59.931260: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 2 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.62 pciBusID: 0000:81:00.0 2019-07-25 17:44:59.933309: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 3 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.62 pciBusID: 0000:82:00.0 2019-07-25 17:44:59.933387: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0 2019-07-25 17:44:59.933422: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0 2019-07-25 17:44:59.933451: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0 2019-07-25 17:44:59.933480: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0 2019-07-25 17:44:59.933509: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0 2019-07-25 17:44:59.933537: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0 2019-07-25 17:44:59.933567: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7 2019-07-25 17:44:59.949195: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1, 2, 3 2019-07-25 17:44:59.949264: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0 2019-07-25 17:44:59.957493: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-07-25 17:44:59.957530: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 1 2 3 2019-07-25 17:44:59.957555: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N Y N N 2019-07-25 17:44:59.957596: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1: Y N N N 2019-07-25 17:44:59.957611: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 2: N N N Y 2019-07-25 17:44:59.957626: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 3: N N Y N 2019-07-25 17:44:59.966648: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10064 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:02:00.0, compute capability: 6.1) 2019-07-25 17:44:59.969143: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10481 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:03:00.0, compute capability: 6.1) 2019-07-25 17:44:59.971469: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10481 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:81:00.0, compute capability: 6.1) 2019-07-25 17:44:59.974262: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10481 MB memory) -> physical GPU (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:82:00.0, compute capability: 6.1) 2019-07-25 17:45:02.640893: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_c ache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile. W0725 17:45:03.616461 139655837968192 deprecation.py:323] From train.py:83: start_queue_runners (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version. Instructions for updating: To construct input pipelines, use the tf.data module.

philthestone commented 3 years ago

I ran into the same problem. Did you find a solution?

AmitMY commented 3 years ago

Not that I recall, sorry