naszilla / tabzilla

Apache License 2.0
114 stars 26 forks source link

TabTransformer CUDA issue #81

Open duncanmcelfresh opened 1 year ago

duncanmcelfresh commented 1 year ago

occurs on datasets:

traceback:

Traceback (most recent call last):
  File "/home/shared/tabzilla/TabSurvey/tabzilla_experiment.py", line 137, in __call__
    result = cross_validation(model, self.dataset, self.time_limit)
  File "/home/shared/tabzilla/TabSurvey/tabzilla_utils.py", line 236, in cross_validation
    loss_history, val_loss_history = curr_model.fit(
  File "/home/shared/tabzilla/TabSurvey/models/tabtransformer.py", line 120, in fit
    loss.backward()
  File "/opt/conda/envs/torch/lib/python3.10/site-packages/torch/_tensor.py", line 363, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/envs/torch/lib/python3.10/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
duncanmcelfresh commented 1 year ago

update - this is a nasty bug.. there are a handful discussions on stackexchange and other github repos trying to diagnose this "CUDA error: invalid configuration argument" error.

this is also an intermediate bug - e.g. it occurs on the datasets listed in the original post, but doesn't occur on many other datasets (e.g., "openmlcredit-approval29" is fine)