isl-org / DeepLagrangianFluids

Code repository for "Lagrangian Fluid Simulation with Continuous Convolutions", ICLR 2020.
Other
207 stars 41 forks source link

Error when running ./train_network_tf.py #17

Closed GUT2060 closed 3 years ago

GUT2060 commented 3 years ago

Here is the full output of ./train_network_tf.py: any idea on how to solve this ?

2021-04-08 12:05:13.984267: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
['./../datasets/ours_default_data/valid/sim_0201_00.msgpack.zst', './../datasets/ours_default_data/valid/sim_0201_01.msgpack.zst', './../datasets/ours_default_data/valid/sim_0201_02.msgpack.zst', './../datasets/ours_default_data/valid/sim_0201_03.msgpack.zst', './../datasets/ours_default_data/valid/sim_0201_04.msgpack.zst', './../datasets/ours_default_data/valid/sim_0201_05.msgpack.zst', './../datasets/ours_default_data/valid/sim_0201_06.msgpack.zst', './../datasets/ours_default_data/valid/sim_0201_07.msgpack.zst', './../datasets/ours_default_data/valid/sim_0201_08.msgpack.zst', './../datasets/ours_default_data/valid/sim_0201_09.msgpack.zst', './../datasets/ours_default_data/valid/sim_0201_10.msgpack.zst', './../datasets/ours_default_data/valid/sim_0201_11.msgpack.zst', './../datasets/ours_default_data/valid/sim_0201_12.msgpack.zst', './../datasets/ours_default_data/valid/sim_0201_13.msgpack.zst', './../datasets/ours_default_data/valid/sim_0201_14.msgpack.zst', './../datasets/ours_default_data/valid/sim_0201_15.msgpack.zst', './../datasets/ours_default_data/valid/sim_0202_00.msgpack.zst', './../datasets/ours_default_data/valid/sim_0202_01.msgpack.zst', './../datasets/ours_default_data/valid/sim_0202_02.msgpack.zst', './../datasets/ours_default_data/valid/sim_0202_03.msgpack.zst'] ...
['./../datasets/ours_default_data/train/sim_0001_00.msgpack.zst', './../datasets/ours_default_data/train/sim_0001_01.msgpack.zst', './../datasets/ours_default_data/train/sim_0001_02.msgpack.zst', './../datasets/ours_default_data/train/sim_0001_03.msgpack.zst', './../datasets/ours_default_data/train/sim_0001_04.msgpack.zst', './../datasets/ours_default_data/train/sim_0001_05.msgpack.zst', './../datasets/ours_default_data/train/sim_0001_06.msgpack.zst', './../datasets/ours_default_data/train/sim_0001_07.msgpack.zst', './../datasets/ours_default_data/train/sim_0001_08.msgpack.zst', './../datasets/ours_default_data/train/sim_0001_09.msgpack.zst', './../datasets/ours_default_data/train/sim_0001_10.msgpack.zst', './../datasets/ours_default_data/train/sim_0001_11.msgpack.zst', './../datasets/ours_default_data/train/sim_0001_12.msgpack.zst', './../datasets/ours_default_data/train/sim_0001_13.msgpack.zst', './../datasets/ours_default_data/train/sim_0001_14.msgpack.zst', './../datasets/ours_default_data/train/sim_0001_15.msgpack.zst', './../datasets/ours_default_data/train/sim_0002_00.msgpack.zst', './../datasets/ours_default_data/train/sim_0002_01.msgpack.zst', './../datasets/ours_default_data/train/sim_0002_02.msgpack.zst', './../datasets/ours_default_data/train/sim_0002_03.msgpack.zst'] ...
[0408 12:05:28 @parallel.py:340] [MultiProcessRunnerZMQ] Will fork a dataflow more than one times. This assumes the datapoints are i.i.d.
2021-04-08 12:05:29.748225: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Process _Worker-1:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "****/DeepLagrangianFluids/env36/lib/python3.6/site-packages/dataflow/dataflow/parallel.py", line 313, in run
    dp = next(itr)
  File "/****/DeepLagrangianFluids/env36/lib/python3.6/site-packages/dataflow/dataflow/parallel.py", line 50, in _repeat_iter
    yield from get_itr()
  File "/****/DeepLagrangianFluids/env36/lib/python3.6/site-packages/dataflow/dataflow/common.py", line 657, in __iter__
    for dp in self._inf_iter:
  File "/****/DeepLagrangianFluids/env36/lib/python3.6/site-packages/dataflow/dataflow/common.py", line 389, in __iter__
    yield from self.ds
  File "/****/DeepLagrangianFluids/env36/lib/python3.6/site-packages/dataflow/dataflow/common.py", line 389, in __iter__
    yield from self.ds
  File "./../datasets/dataset_reader_physics.py", line 46, in __iter__
    box = data[0]['box']
IndexError: list index out of range
2021-04-08 12:05:32.033298: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
./../datasets/ours_default_data/train/sim_0118_08.msgpack.zst HERE !!
Process _Worker-2:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/****/DeepLagrangianFluids/env36/lib/python3.6/site-packages/dataflow/dataflow/parallel.py", line 313, in run
    dp = next(itr)
  File "/****/DeepLagrangianFluids/env36/lib/python3.6/site-packages/dataflow/dataflow/parallel.py", line 50, in _repeat_iter
    yield from get_itr()
  File "/****/DeepLagrangianFluids/env36/lib/python3.6/site-packages/dataflow/dataflow/common.py", line 657, in __iter__
    for dp in self._inf_iter:
  File "/****/DeepLagrangianFluids/env36/lib/python3.6/site-packages/dataflow/dataflow/common.py", line 389, in __iter__
    yield from self.ds
  File "/****/DeepLagrangianFluids/env36/lib/python3.6/site-packages/dataflow/dataflow/common.py", line 389, in __iter__
    yield from self.ds
  File "./../datasets/dataset_reader_physics.py", line 46, in __iter__
    box = data[0]['box']
IndexError: list index out of range
2021-04-08 12:05:44.522652: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-04-08 12:05:47.379310: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2021-04-08 12:05:47.437509: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-04-08 12:05:47.438251: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:03:00.0 name: GeForce 940MX computeCapability: 5.0
coreClock: 1.2415GHz coreCount: 3 deviceMemorySize: 3.95GiB deviceMemoryBandwidth: 13.41GiB/s
2021-04-08 12:05:47.438293: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-04-08 12:05:47.438434: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcublas.so.10'; dlerror: libcublas.so.10: cannot open shared object file: No such file or directory
2021-04-08 12:05:47.464373: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2021-04-08 12:05:47.494165: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2021-04-08 12:05:47.544487: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2021-04-08 12:05:47.565652: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2021-04-08 12:05:47.662424: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2021-04-08 12:05:47.662515: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2021-04-08 12:05:47.663295: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-04-08 12:05:47.793170: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2899885000 Hz
2021-04-08 12:05:47.794340: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5b046d0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-04-08 12:05:47.794459: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2021-04-08 12:05:47.823957: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-04-08 12:05:47.824033: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      
# 2021-04-08 12:05:48        0 n/a ips                 n/a rem | 
[0408 12:05:48 @parallel.py:351] ERR Exception '<class 'IndexError'>' in worker:
Traceback (most recent call last):
  File "./train_network_tf.py", line 165, in <module>
    sys.exit(main())
  File "./train_network_tf.py", line 134, in main
    batch = next(data_iter)
  File "/****/DeepLagrangianFluids/env36/lib/python3.6/site-packages/dataflow/dataflow/common.py", line 120, in __iter__
    for data in self.ds:
  File "/****/DeepLagrangianFluids/env36/lib/python3.6/site-packages/dataflow/dataflow/parallel.py", line 363, in __iter__
    yield self._recv()
  File "/****/DeepLagrangianFluids/env36/lib/python3.6/site-packages/dataflow/dataflow/parallel.py", line 352, in _recv
    raise exc.exc_type(exc.exc_msg)
IndexError: Traceback (most recent call last):
  File "/****/DeepLagrangianFluids/env36/lib/python3.6/site-packages/dataflow/dataflow/parallel.py", line 313, in run
    dp = next(itr)
  File "/****/DeepLagrangianFluids/env36/lib/python3.6/site-packages/dataflow/dataflow/parallel.py", line 50, in _repeat_iter
    yield from get_itr()
  File "/****/DeepLagrangianFluids/env36/lib/python3.6/site-packages/dataflow/dataflow/common.py", line 657, in __iter__
    for dp in self._inf_iter:
  File "/****/DeepLagrangianFluids/env36/lib/python3.6/site-packages/dataflow/dataflow/common.py", line 389, in __iter__
    yield from self.ds
  File "/****/DeepLagrangianFluids/env36/lib/python3.6/site-packages/dataflow/dataflow/common.py", line 389, in __iter__
    yield from self.ds
  File "./../datasets/dataset_reader_physics.py", line 46, in __iter__
    box = data[0]['box']
IndexError: list index out of range

MultiProcessRunnerZMQ successfully cleaned-up.
MultiProcessRunnerZMQ successfully cleaned-up.
benjaminum commented 3 years ago

Hi @GUT2060 can you check the file size of the generated dataset files?

GUT2060 commented 3 years ago

Hi @benjaminum , you're hint was helpful , the size of the files were 3kb , the generation was not done properly i'm not sure what was the cause of this issue . running the script on another PC , ubuntu 20.4 , python 3.8 the problem is gone . thanks