karpathy / nanoGPT

The simplest, fastest repository for training/finetuning medium-sized GPTs.
MIT License
37.49k stars 5.97k forks source link

"RuntimeError: Internal Triton PTX codegen error" is raised when I train shakespeare_char with a GPU #525

Open shenbb opened 5 months ago

shenbb commented 5 months ago

python3 train.py config/train_shakespeare_char.py

Overriding config with config/train_shakespeare_char.py:

train a miniature character-level shakespeare model

good for debugging and playing on macbooks and such

out_dir = 'out-shakespeare-char' eval_interval = 250 # keep frequent because we'll overfit eval_iters = 200 log_interval = 10 # don't print too too often

we expect to overfit on this small dataset, so only save when val improves

always_save_checkpoint = False

wandb_log = False # override via command line if you like wandb_project = 'shakespeare-char' wandb_run_name = 'mini-gpt'

dataset = 'shakespeare_char' gradient_accumulation_steps = 1 batch_size = 64 block_size = 256 # context of up to 256 previous characters

baby GPT model :)

n_layer = 6 n_head = 6 n_embd = 384 dropout = 0.2

learning_rate = 1e-3 # with baby networks can afford to go a bit higher max_iters = 5000 lr_decay_iters = 5000 # make equal to max_iters usually min_lr = 1e-4 # learning_rate / 10 usually beta2 = 0.99 # make a bit bigger because number of tokens per iter is small

warmup_iters = 100 # not super necessary potentially

on macbook also add

device = 'cpu' # run on cpu only

compile = False # do not torch compile the model

tokens per iteration will be: 16,384 found vocab_size = 65 (inside data/shakespeare_char/meta.pkl) Initializing a new model from scratch number of parameters: 10.65M num decayed parameter tensors: 26, with 10,740,096 parameters num non-decayed parameter tensors: 13, with 4,992 parameters using fused AdamW: True compiling the model... (takes a ~minute) Traceback (most recent call last): File "train.py", line 264, in losses = estimate_loss() File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "train.py", line 224, in estimate_loss logits, loss = model(X, Y) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/_dynamo/eval_frame.py", line 451, in _fn return fn(*args, kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/_dynamo/convert_frame.py", line 921, in catch_errors return callback(frame, cache_entry, hooks, frame_state, skip=1) File "/usr/local/lib/python3.8/dist-packages/torch/_dynamo/convert_frame.py", line 786, in _convert_frame result = inner_convert( File "/usr/local/lib/python3.8/dist-packages/torch/_dynamo/convert_frame.py", line 400, in _convert_frame_assert return _compile( File "/usr/lib/python3.8/contextlib.py", line 75, in inner return func(*args, kwds) File "/usr/local/lib/python3.8/dist-packages/torch/_dynamo/convert_frame.py", line 676, in _compile guarded_code = compile_inner(code, one_graph, hooks, transform) File "/usr/local/lib/python3.8/dist-packages/torch/_dynamo/utils.py", line 262, in time_wrapper r = func(*args, *kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/_dynamo/convert_frame.py", line 535, in compile_inner out_code = transform_code_object(code, transform) File "/usr/local/lib/python3.8/dist-packages/torch/_dynamo/bytecode_transformation.py", line 1036, in transform_code_object transformations(instructions, code_options) File "/usr/local/lib/python3.8/dist-packages/torch/_dynamo/convert_frame.py", line 165, in _fn return fn(args, kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/_dynamo/convert_frame.py", line 500, in transform tracer.run() File "/usr/local/lib/python3.8/dist-packages/torch/_dynamo/symbolic_convert.py", line 2149, in run super().run() File "/usr/local/lib/python3.8/dist-packages/torch/_dynamo/symbolic_convert.py", line 810, in run and self.step() File "/usr/local/lib/python3.8/dist-packages/torch/_dynamo/symbolic_convert.py", line 773, in step getattr(self, inst.opname)(inst) File "/usr/local/lib/python3.8/dist-packages/torch/_dynamo/symbolic_convert.py", line 2268, in RETURN_VALUE self.output.compile_subgraph( File "/usr/local/lib/python3.8/dist-packages/torch/_dynamo/output_graph.py", line 1001, in compile_subgraph self.compile_and_call_fx_graph(tx, pass2.graph_output_vars(), root) File "/usr/lib/python3.8/contextlib.py", line 75, in inner return func(*args, kwds) File "/usr/local/lib/python3.8/dist-packages/torch/_dynamo/output_graph.py", line 1178, in compile_and_call_fx_graph compiled_fn = self.call_user_compiler(gm) File "/usr/local/lib/python3.8/dist-packages/torch/_dynamo/utils.py", line 262, in time_wrapper r = func(*args, *kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/_dynamo/output_graph.py", line 1251, in call_user_compiler raise BackendCompilerFailed(self.compiler_fn, e).with_traceback( File "/usr/local/lib/python3.8/dist-packages/torch/_dynamo/output_graph.py", line 1232, in call_user_compiler compiled_fn = compiler_fn(gm, self.example_inputs()) File "/usr/local/lib/python3.8/dist-packages/torch/_dynamo/repro/after_dynamo.py", line 117, in debug_wrapper compiled_gm = compiler_fn(gm, example_inputs) File "/usr/local/lib/python3.8/dist-packages/torch/init.py", line 1731, in call return compilefx(model, inputs_, config_patches=self.config) File "/usr/lib/python3.8/contextlib.py", line 75, in inner return func(args, kwds) File "/usr/local/lib/python3.8/dist-packages/torch/_inductor/compile_fx.py", line 1330, in compile_fx return aot_autograd( File "/usr/local/lib/python3.8/dist-packages/torch/_dynamo/backends/common.py", line 58, in compiler_fn cg = aot_module_simplified(gm, example_inputs, kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/_functorch/aot_autograd.py", line 903, in aot_module_simplified compiled_fn = create_aot_dispatcher_function( File "/usr/local/lib/python3.8/dist-packages/torch/_dynamo/utils.py", line 262, in time_wrapper r = func(*args, *kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/_functorch/aot_autograd.py", line 628, in create_aot_dispatcher_function compiled_fn = compiler_fn(flat_fn, fake_flat_args, aot_config, fw_metadata=fw_metadata) File "/usr/local/lib/python3.8/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 443, in aot_wrapper_dedupe return compiler_fn(flat_fn, leaf_flat_args, aot_config, fw_metadata=fw_metadata) File "/usr/local/lib/python3.8/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 648, in aot_wrapper_synthetic_base return compiler_fn(flat_fn, flat_args, aot_config, fw_metadata=fw_metadata) File "/usr/local/lib/python3.8/dist-packages/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 119, in aot_dispatch_base compiled_fw = compiler(fw_module, updated_flat_args) File "/usr/local/lib/python3.8/dist-packages/torch/_dynamo/utils.py", line 262, in time_wrapper r = func(args, kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/_inductor/compile_fx.py", line 1257, in fw_compiler_base return inner_compile( File "/usr/local/lib/python3.8/dist-packages/torch/_dynamo/repro/after_aot.py", line 83, in debug_wrapper inner_compiled_fn = compiler_fn(gm, example_inputs) File "/usr/local/lib/python3.8/dist-packages/torch/_inductor/debug.py", line 304, in inner return fn(*args, kwargs) File "/usr/lib/python3.8/contextlib.py", line 75, in inner return func(*args, *kwds) File "/usr/lib/python3.8/contextlib.py", line 75, in inner return func(args, kwds) File "/usr/local/lib/python3.8/dist-packages/torch/_dynamo/utils.py", line 262, in time_wrapper r = func(*args, *kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/_inductor/compile_fx.py", line 438, in compile_fx_inner compiled_graph = fx_codegen_and_compile( File "/usr/local/lib/python3.8/dist-packages/torch/_inductor/compile_fx.py", line 714, in fx_codegen_and_compile compiled_fn = graph.compile_to_fn() File "/usr/local/lib/python3.8/dist-packages/torch/_inductor/graph.py", line 1307, in compile_to_fn return self.compile_to_module().call File "/usr/local/lib/python3.8/dist-packages/torch/_dynamo/utils.py", line 262, in time_wrapper r = func(args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/_inductor/graph.py", line 1254, in compile_to_module mod = PyCodeCache.load_by_key_path( File "/usr/local/lib/python3.8/dist-packages/torch/_inductor/codecache.py", line 2160, in load_by_key_path exec(code, mod.dict, mod.dict) File "/tmp/torchinductor_libra/6z/c6zptqfvl4uwgoca6tk4qimwczeni4sq2plv5hxtx7vncbopqccc.py", line 1162, in async_compile.wait(globals()) File "/usr/local/lib/python3.8/dist-packages/torch/_inductor/codecache.py", line 2715, in wait scope[key] = result.result() File "/usr/local/lib/python3.8/dist-packages/torch/_inductor/codecache.py", line 2522, in result self.future.result() File "/usr/lib/python3.8/concurrent/futures/_base.py", line 444, in result return self.get_result() File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in get_result raise self._exception torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: RuntimeError: Internal Triton PTX codegen error: ptxas /tmp/compile-ptx-src-863569, line 636; error : Feature '.bf16' requires .target sm_80 or higher ptxas /tmp/compile-ptx-src-863569, line 636; error : Feature 'cvt.bf16.f32' requires .target sm_80 or higher ptxas /tmp/compile-ptx-src-863569, line 638; error : Feature '.bf16' requires .target sm_80 or higher ptxas /tmp/compile-ptx-src-863569, line 638; error : Feature 'cvt.bf16.f32' requires .target sm_80 or higher ptxas /tmp/compile-ptx-src-863569, line 640; error : Feature '.bf16' requires .target sm_80 or higher ptxas /tmp/compile-ptx-src-863569, line 640; error : Feature 'cvt.bf16.f32' requires .target sm_80 or higher ptxas /tmp/compile-ptx-src-863569, line 642; error : Feature '.bf16' requires .target sm_80 or higher ptxas /tmp/compile-ptx-src-863569, line 642; error : Feature 'cvt.bf16.f32' requires .target sm_80 or higher ptxas fatal : Ptx assembly aborted due to errors

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

You can suppress this exception and fall back to eager by setting: import torch._dynamo torch._dynamo.config.suppress_errors = True

mishra011 commented 5 months ago

Facing same issue while python train.py config/train_shakespeare_char.py on colab gpu

Elrashid commented 5 months ago

@mishra011 @shenbb

I tried running the following code:

!pip install tiktoken
!git clone https://github.com/karpathy/nanoGPT.git
%cd nanoGPT
!python train.py config/train_shakespeare_char.py
!python sample.py --out_dir=out-shakespeare-char

Found a GPU Compatibility Issue:

If you're using the standard V100-SXM2-16GB GPU, you might face compatibility issues due to the limited memory and capabilities required by the model.

Recommendation (this worked for me, didn't have time to dig down further):

To avoid this error, upgrade to Colab Pro and ensure you select the A100-SXM4-40GB GPU in your runtime settings. This should resolve the issue and allow your model to train successfully.

lichengshen commented 5 months ago

The issue comes from torch.compile() so you can pass --compile=False as a workaround.

xinge449 commented 4 months ago

The issue comes from torch.compile() so you can pass --compile=False as a workaround.

在我这边成功了,非常感谢,Thinks a lot

lise-brinck commented 3 months ago

I am facing the same issue on a T4 GPU:

E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] Error in subprocess
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] concurrent.futures.process._RemoteTraceback:
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] """
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] Traceback (most recent call last):
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205]   File "/home/ubuntu/nanoGPT/.venv/lib/python3.9/site-packages/triton/backends/nvidia/compiler.py", line 292, in make_cubin
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205]     subprocess.run(cmd, shell=True, check=True)
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205]   File "/home/ubuntu/.pyenv/versions/3.9.17/lib/python3.9/subprocess.py", line 528, in run
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205]     raise CalledProcessError(retcode, process.args,
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] subprocess.CalledProcessError: Command '/home/ubuntu/nanoGPT/.venv/lib/python3.9/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_75 /tmp/tmpjp5bgu11.ptx -o /tmp/tmpjp5bgu11.ptx.o 2> /tmp/tmp1ew0x1sq.log' returned non-zero exit status 255.
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205]
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] During handling of the above exception, another exception occurred:
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205]
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] Traceback (most recent call last):
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205]   File "/home/ubuntu/.pyenv/versions/3.9.17/lib/python3.9/concurrent/futures/process.py", line 246, in _process_worker
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205]     r = call_item.fn(*call_item.args, **call_item.kwargs)
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205]   File "/home/ubuntu/nanoGPT/.venv/lib/python3.9/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 218, in do_job
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205]     result = job()
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205]   File "/home/ubuntu/nanoGPT/.venv/lib/python3.9/site-packages/torch/_inductor/runtime/compile_tasks.py", line 69, in _worker_compile_triton
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205]     load_kernel().precompile(warm_cache_only=True)
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205]   File "/home/ubuntu/nanoGPT/.venv/lib/python3.9/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 232, in precompile
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205]     compiled_binary, launcher = self._precompile_config(
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205]   File "/home/ubuntu/nanoGPT/.venv/lib/python3.9/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 416, in _precompile_config
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205]     triton.compile(*compile_args, **compile_kwargs),
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205]   File "/home/ubuntu/nanoGPT/.venv/lib/python3.9/site-packages/triton/compiler/compiler.py", line 282, in compile
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205]     next_module = compile_ir(module, metadata)
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205]   File "/home/ubuntu/nanoGPT/.venv/lib/python3.9/site-packages/triton/backends/nvidia/compiler.py", line 320, in <lambda>
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205]     stages["cubin"] = lambda src, metadata: self.make_cubin(src, metadata, options, self.capability)
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205]   File "/home/ubuntu/nanoGPT/.venv/lib/python3.9/site-packages/triton/backends/nvidia/compiler.py", line 297, in make_cubin
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205]     raise RuntimeError(f'Internal Triton PTX codegen error: \n{log}')
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] RuntimeError: Internal Triton PTX codegen error:
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] ptxas /tmp/tmpjp5bgu11.ptx, line 66; error   : Feature '.bf16' requires .target sm_80 or higher
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] ptxas /tmp/tmpjp5bgu11.ptx, line 66; error   : Feature 'cvt.bf16.f32' requires .target sm_80 or higher
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] ptxas /tmp/tmpjp5bgu11.ptx, line 69; error   : Feature '.bf16' requires .target sm_80 or higher
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] ptxas /tmp/tmpjp5bgu11.ptx, line 69; error   : Feature 'cvt.bf16.f32' requires .target sm_80 or higher
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] ptxas /tmp/tmpjp5bgu11.ptx, line 72; error   : Feature '.bf16' requires .target sm_80 or higher
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] ptxas /tmp/tmpjp5bgu11.ptx, line 72; error   : Feature 'cvt.bf16.f32' requires .target sm_80 or higher
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] ptxas /tmp/tmpjp5bgu11.ptx, line 75; error   : Feature '.bf16' requires .target sm_80 or higher
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] ptxas /tmp/tmpjp5bgu11.ptx, line 75; error   : Feature 'cvt.bf16.f32' requires .target sm_80 or higher
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] ptxas /tmp/tmpjp5bgu11.ptx, line 78; error   : Feature '.bf16' requires .target sm_80 or higher
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] ptxas /tmp/tmpjp5bgu11.ptx, line 78; error   : Feature 'cvt.bf16.f32' requires .target sm_80 or higher
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] ptxas /tmp/tmpjp5bgu11.ptx, line 81; error   : Feature '.bf16' requires .target sm_80 or higher
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] ptxas /tmp/tmpjp5bgu11.ptx, line 81; error   : Feature 'cvt.bf16.f32' requires .target sm_80 or higher
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] ptxas /tmp/tmpjp5bgu11.ptx, line 84; error   : Feature '.bf16' requires .target sm_80 or higher
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] ptxas /tmp/tmpjp5bgu11.ptx, line 84; error   : Feature 'cvt.bf16.f32' requires .target sm_80 or higher
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] ptxas /tmp/tmpjp5bgu11.ptx, line 87; error   : Feature '.bf16' requires .target sm_80 or higher
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] ptxas /tmp/tmpjp5bgu11.ptx, line 87; error   : Feature 'cvt.bf16.f32' requires .target sm_80 or higher
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] ptxas fatal   : Ptx assembly aborted due to errors
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205]
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] """
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205]
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] The above exception was the direct cause of the following exception:
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205]
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] Traceback (most recent call last):
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205]   File "/home/ubuntu/nanoGPT/.venv/lib/python3.9/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 203, in callback
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205]     result = future.result()
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205]   File "/home/ubuntu/.pyenv/versions/3.9.17/lib/python3.9/concurrent/futures/_base.py", line 439, in result
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205]     return self.__get_result()
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205]   File "/home/ubuntu/.pyenv/versions/3.9.17/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205]     raise self._exception
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] RuntimeError: Internal Triton PTX codegen error:
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] ptxas /tmp/tmpjp5bgu11.ptx, line 66; error   : Feature '.bf16' requires .target sm_80 or higher
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] ptxas /tmp/tmpjp5bgu11.ptx, line 66; error   : Feature 'cvt.bf16.f32' requires .target sm_80 or higher
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] ptxas /tmp/tmpjp5bgu11.ptx, line 69; error   : Feature '.bf16' requires .target sm_80 or higher
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] ptxas /tmp/tmpjp5bgu11.ptx, line 69; error   : Feature 'cvt.bf16.f32' requires .target sm_80 or higher
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] ptxas /tmp/tmpjp5bgu11.ptx, line 72; error   : Feature '.bf16' requires .target sm_80 or higher
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] ptxas /tmp/tmpjp5bgu11.ptx, line 72; error   : Feature 'cvt.bf16.f32' requires .target sm_80 or higher
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] ptxas /tmp/tmpjp5bgu11.ptx, line 75; error   : Feature '.bf16' requires .target sm_80 or higher
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] ptxas /tmp/tmpjp5bgu11.ptx, line 75; error   : Feature 'cvt.bf16.f32' requires .target sm_80 or higher
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] ptxas /tmp/tmpjp5bgu11.ptx, line 78; error   : Feature '.bf16' requires .target sm_80 or higher
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] ptxas /tmp/tmpjp5bgu11.ptx, line 78; error   : Feature 'cvt.bf16.f32' requires .target sm_80 or higher
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] ptxas /tmp/tmpjp5bgu11.ptx, line 81; error   : Feature '.bf16' requires .target sm_80 or higher
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] ptxas /tmp/tmpjp5bgu11.ptx, line 81; error   : Feature 'cvt.bf16.f32' requires .target sm_80 or higher
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] ptxas /tmp/tmpjp5bgu11.ptx, line 84; error   : Feature '.bf16' requires .target sm_80 or higher
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] ptxas /tmp/tmpjp5bgu11.ptx, line 84; error   : Feature 'cvt.bf16.f32' requires .target sm_80 or higher
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] ptxas /tmp/tmpjp5bgu11.ptx, line 87; error   : Feature '.bf16' requires .target sm_80 or higher
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] ptxas /tmp/tmpjp5bgu11.ptx, line 87; error   : Feature 'cvt.bf16.f32' requires .target sm_80 or higher
E0813 06:09:49.845341 140336920782592 torch/_inductor/compile_worker/subproc_pool.py:205] ptxas fatal   : Ptx assembly aborted due to errors

Pytorch version: 2.4.0+cu118 nvcc version:

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

Output of nvidia-smi:

Tue Aug 13 06:11:55 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   30C    P8    14W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+