qnl / qnl_nonmarkov_ml

Machine learning for non-Markovian trajectories
3 stars 3 forks source link

Model gets -1 when expecting float: mask value or missing labels #4

Open noahstevenson opened 3 years ago

noahstevenson commented 3 years ago

Attempting to train model by running train.py results in TypeError: Expected int64 passed to parameter 'y' of op 'NotEqual', got -1.0 of type 'float' instead. Error: Expected int64, got -1.0 of type 'float' instead., making me think the LSTM training step isn't recognizing the mask value of -1 or there are labels missing. Using cr_trajectories_dev branch, which branched from master at commit id c7c044426731ee97b4599c3dad831954dcc41d52. @gkoolstra have you seen this error before?

The settings are

last_timestep = 249
mask_value = -1.0  # This is the mask value for the data, not the missing labels
total_epochs = 50  # Number of epochs for the training
mini_batch_size = 1024  # Batch size
lstm_neurons = 32  # Depth of the LSTM layer
strong_ro_dt = 20e-9  # Time interval for strong readout in the dataset in seconds

I noticed that assembled model has an incorrect number of time steps (120 vs the declared 249), however the error remains with a lower last_timestep. I've used an essentially identical file (i.e. only changes being measurement settings) on other data without an issue, and the total number of voltage records is different than I expect by a factor of ~12, so checking this now.

Full traceback:

(qutip-env) [qnl@kraken vanilla_lstm]$ python train.py
/home/qnl/miniconda3/envs/qutip-env/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 216, got 192
  return f(*args, **kwds)
2020-10-15 14:11:17.467923: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/qnl/miniconda3/envs/qutip-env/lib/libfabric:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/extras/CUPTI/lib64:
2020-10-15 14:11:17.468057: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/qnl/miniconda3/envs/qutip-env/lib/libfabric:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/extras/CUPTI/lib64:
2020-10-15 14:11:17.468080: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2020-10-15 14:11:19.118104: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-10-15 14:11:19.146598: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:03:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1
coreClock: 1.683GHz coreCount: 28 deviceMemorySize: 10.91GiB deviceMemoryBandwidth: 451.17GiB/s
2020-10-15 14:11:19.146777: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/qnl/miniconda3/envs/qutip-env/lib/libfabric:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/extras/CUPTI/lib64:
2020-10-15 14:11:19.146884: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10'; dlerror: libcublas.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/qnl/miniconda3/envs/qutip-env/lib/libfabric:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/extras/CUPTI/lib64:
2020-10-15 14:11:19.146988: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/qnl/miniconda3/envs/qutip-env/lib/libfabric:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/extras/CUPTI/lib64:
2020-10-15 14:11:19.147090: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10'; dlerror: libcurand.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/qnl/miniconda3/envs/qutip-env/lib/libfabric:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/extras/CUPTI/lib64:
2020-10-15 14:11:19.147190: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/qnl/miniconda3/envs/qutip-env/lib/libfabric:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/extras/CUPTI/lib64:
2020-10-15 14:11:19.147287: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10'; dlerror: libcusparse.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/qnl/miniconda3/envs/qutip-env/lib/libfabric:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/extras/CUPTI/lib64:
2020-10-15 14:11:19.150879: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-10-15 14:11:19.150912: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1592] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
[]
True
Creating model...
2020-10-15 14:11:21.559737: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-10-15 14:11:21.568508: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2100025000 Hz
2020-10-15 14:11:21.569526: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7813990 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-10-15 14:11:21.569561: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-10-15 14:11:21.655420: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7871680 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-10-15 14:11:21.655467: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
2020-10-15 14:11:21.655607: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-10-15 14:11:21.655628: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      
Building model...
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
masking (Masking)            (None, 120, 2)            0         
_________________________________________________________________
lstm (LSTM)                  (None, 120, 32)           4480      
_________________________________________________________________
time_distributed (TimeDistri (None, 120, 6)            198       
=================================================================
Total params: 4,678
Trainable params: 4,678
Non-trainable params: 0
_________________________________________________________________
Compiling model...
Expected accuracy should converge to 0.8517304056432792
Training started...
Train on 17949 samples, validate on 1995 samples
Setting up a new session...
Epoch 1/50
2020-10-15 14:11:25.803284: I tensorflow/core/profiler/lib/profiler_session.cc:225] Profiler session started.
2020-10-15 14:11:25.803361: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1259] Profiler found 1 GPUs
2020-10-15 14:11:25.803582: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcupti.so.10.1'; dlerror: libcupti.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/qnl/miniconda3/envs/qutip-env/lib/libfabric:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/extras/CUPTI/lib64:
2020-10-15 14:11:25.803621: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1307] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI could not be loaded or symbol could not be found.
2020-10-15 14:11:25.803642: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1346] function cupti_interface_->ActivityRegisterCallbacks( AllocCuptiActivityBuffer, FreeCuptiActivityBuffer)failed with error CUPTI could not be loaded or symbol could not be found.
 1024/17949 [>.............................] - ETA: 13sDropout scheduling failed.
2020-10-15 14:11:25.827748: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1329] function cupti_interface_->EnableCallback( 0 , subscriber_, CUPTI_CB_DOMAIN_DRIVER_API, cbid)failed with error CUPTI could not be loaded or symbol could not be found.
2020-10-15 14:11:25.827800: I tensorflow/core/profiler/internal/gpu/device_tracer.cc:88]  GpuTracer has collected 0 callback api events and 0 activity events.
Traceback (most recent call last):
  File "/home/qnl/miniconda3/envs/qutip-env/lib/python3.6/site-packages/tensorflow_core/python/framework/tensor_util.py", line 324, in _AssertCompatible
    fn(values)
  File "/home/qnl/miniconda3/envs/qutip-env/lib/python3.6/site-packages/tensorflow_core/python/framework/tensor_util.py", line 263, in inner
    _ = [_check_failed(v) for v in nest.flatten(values)
  File "/home/qnl/miniconda3/envs/qutip-env/lib/python3.6/site-packages/tensorflow_core/python/framework/tensor_util.py", line 264, in <listcomp>
    if not isinstance(v, expected_types)]
  File "/home/qnl/miniconda3/envs/qutip-env/lib/python3.6/site-packages/tensorflow_core/python/framework/tensor_util.py", line 248, in _check_failed
    raise ValueError(v)
ValueError: -1.0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/qnl/miniconda3/envs/qutip-env/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 468, in _apply_op_helper
    preferred_dtype=default_dtype)
  File "/home/qnl/miniconda3/envs/qutip-env/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1314, in convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/home/qnl/miniconda3/envs/qutip-env/lib/python3.6/site-packages/tensorflow_core/python/framework/tensor_conversion_registry.py", line 52, in _default_conversion_function
    return constant_op.constant(value, dtype, name=name)
  File "/home/qnl/miniconda3/envs/qutip-env/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 258, in constant
    allow_broadcast=True)
  File "/home/qnl/miniconda3/envs/qutip-env/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 296, in _constant_impl
    allow_broadcast=allow_broadcast))
  File "/home/qnl/miniconda3/envs/qutip-env/lib/python3.6/site-packages/tensorflow_core/python/framework/tensor_util.py", line 451, in make_tensor_proto
    _AssertCompatible(values, dtype)
  File "/home/qnl/miniconda3/envs/qutip-env/lib/python3.6/site-packages/tensorflow_core/python/framework/tensor_util.py", line 331, in _AssertCompatible
    (dtype.name, repr(mismatch), type(mismatch).__name__))
TypeError: Expected int64, got -1.0 of type 'float' instead.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 107, in <module>
    history = m.fit_model(total_epochs)
  File "/home/qnl/noah/projects/2020-NonMarkovTrajectories/code/qnl_nonmarkov_ml/vanilla_lstm/vanilla_lstm.py", line 159, in fit_model
    DropOutScheduler(self.dropout_schedule)])
  File "/home/qnl/miniconda3/envs/qutip-env/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training.py", line 819, in fit
    use_multiprocessing=use_multiprocessing)
  File "/home/qnl/miniconda3/envs/qutip-env/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 342, in fit
    total_epochs=epochs)
  File "/home/qnl/miniconda3/envs/qutip-env/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 128, in run_one_epoch
    batch_outs = execution_function(iterator)
  File "/home/qnl/miniconda3/envs/qutip-env/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 98, in execution_function
    distributed_function(input_fn))
  File "/home/qnl/miniconda3/envs/qutip-env/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 568, in __call__
    result = self._call(*args, **kwds)
  File "/home/qnl/miniconda3/envs/qutip-env/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 615, in _call
    self._initialize(args, kwds, add_initializers_to=initializers)
  File "/home/qnl/miniconda3/envs/qutip-env/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 497, in _initialize
    *args, **kwds))
  File "/home/qnl/miniconda3/envs/qutip-env/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 2389, in _get_concrete_function_internal_garbage_collected
    graph_function, _, _ = self._maybe_define_function(args, kwargs)
  File "/home/qnl/miniconda3/envs/qutip-env/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 2703, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/home/qnl/miniconda3/envs/qutip-env/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 2593, in _create_graph_function
    capture_by_value=self._capture_by_value),
  File "/home/qnl/miniconda3/envs/qutip-env/lib/python3.6/site-packages/tensorflow_core/python/framework/func_graph.py", line 978, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/home/qnl/miniconda3/envs/qutip-env/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 439, in wrapped_fn
    return weak_wrapped_fn().__wrapped__(*args, **kwds)
  File "/home/qnl/miniconda3/envs/qutip-env/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 85, in distributed_function
    per_replica_function, args=args)
  File "/home/qnl/miniconda3/envs/qutip-env/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 763, in experimental_run_v2
    return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
  File "/home/qnl/miniconda3/envs/qutip-env/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1819, in call_for_each_replica
    return self._call_for_each_replica(fn, args, kwargs)
  File "/home/qnl/miniconda3/envs/qutip-env/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_lib.py", line 2164, in _call_for_each_replica
    return fn(*args, **kwargs)
  File "/home/qnl/miniconda3/envs/qutip-env/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 292, in wrapper
    return func(*args, **kwargs)
  File "/home/qnl/miniconda3/envs/qutip-env/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 433, in train_on_batch
    output_loss_metrics=model._output_loss_metrics)
  File "/home/qnl/miniconda3/envs/qutip-env/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_eager.py", line 312, in train_on_batch
    output_loss_metrics=output_loss_metrics))
  File "/home/qnl/miniconda3/envs/qutip-env/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_eager.py", line 253, in _process_single_batch
    training=training))
  File "/home/qnl/miniconda3/envs/qutip-env/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_eager.py", line 167, in _model_loss
    per_sample_losses = loss_fn.call(targets[i], outs[i])
  File "/home/qnl/miniconda3/envs/qutip-env/lib/python3.6/site-packages/tensorflow_core/python/keras/losses.py", line 221, in call
    return self.fn(y_true, y_pred, **self._fn_kwargs)
  File "/home/qnl/noah/projects/2020-NonMarkovTrajectories/code/qnl_nonmarkov_ml/vanilla_lstm/vanilla_lstm.py", line 188, in masked_loss_function
    mask = K.cast(K.not_equal(y_true, self.mask_value), K.floatx())
  File "/home/qnl/miniconda3/envs/qutip-env/lib/python3.6/site-packages/tensorflow_core/python/keras/backend.py", line 2331, in not_equal
    return math_ops.not_equal(x, y)
  File "/home/qnl/miniconda3/envs/qutip-env/lib/python3.6/site-packages/tensorflow_core/python/util/dispatch.py", line 180, in wrapper
    return target(*args, **kwargs)
  File "/home/qnl/miniconda3/envs/qutip-env/lib/python3.6/site-packages/tensorflow_core/python/ops/math_ops.py", line 1340, in not_equal
    return gen_math_ops.not_equal(x, y, name=name)
  File "/home/qnl/miniconda3/envs/qutip-env/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_math_ops.py", line 6455, in not_equal
    name=name)
  File "/home/qnl/miniconda3/envs/qutip-env/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 477, in _apply_op_helper
    repr(values), type(values).__name__, err))
TypeError: Expected int64 passed to parameter 'y' of op 'NotEqual', got -1.0 of type 'float' instead. Error: Expected int64, got -1.0 of type 'float' instead.