pytorch / torchdynamo

A Python-level JIT compiler designed to make unmodified PyTorch programs faster.
BSD 3-Clause "New" or "Revised" License
1.01k stars 123 forks source link

benchmark training_loss.py with inductor got RuntimeError: Triton Error [CUDA]: invalid argument #1632

Closed SeaOfOcean closed 2 years ago

SeaOfOcean commented 2 years ago

I followed the main instruction to set up the environment, and run the following example with:

python training_loss.py --epoch 1

my env setup is as follows:

Driver Version: 470.82.01    CUDA Version: 11.7
Python 3.8.13
torch version '1.14.0.dev20221012+cu117'
pip install -U "git+https://github.com/openai/triton@af76c989eb4799b015f8b288ccd8421558772e56#subdirectory=python"

but got error

(base) root@iZ6weiy0cic40635z4n81hZ:/workspace/torchdynamo/benchmarks# python training_loss.py --epoch 1
WARNING:datasets.builder:Found cached dataset yelp_review_full (/root/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 475.25it/s]
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf/cache-e7c96b5d8739ab92.arrow
WARNING:datasets.arrow_dataset:Loading cached processed dataset at /root/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf/cache-0f565804218510af.arrow
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relati
onship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/root/miniconda3/lib/python3.8/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: /root/miniconda3/lib/python3.8/site-packages/torchvision/image.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator12re
cordStreamERKNS_7DataPtrENS0_10CUDAStreamE
  warn(f"Failed to load image Python extension: {e}")
WARNING:torchinductor.lowering:make_fallback(aten.cumsum): a decomposition exists, we should switch to it
WARNING:torchinductor.lowering:make_fallback(aten.unfold): a decomposition exists, we should switch to it
WARNING:torchinductor.lowering:make_fallback(aten.unfold_backward): a decomposition exists, we should switch to it
[2022-10-13 09:48:21,971] torchdynamo.symbolic_convert: [WARNING] Graph break: Tensor.backward from user code at   File "training_loss.py", line 50, in training_iter_fn
    loss.backward()

[2022-10-13 09:48:28,268] torchinductor.lowering: [WARNING] using triton random, expect difference from eager

Traceback (most recent call last):
  File "training_loss.py", line 207, in <module>
    main()
  File "training_loss.py", line 175, in main
    res_loss, accuracy = model_training_evaluation(
  File "training_loss.py", line 72, in model_training_evaluation
    loss = opt_training_iter_fn(batch, model, optimizer)
  File "/root/miniconda3/lib/python3.8/site-packages/torchdynamo/eval_frame.py", line 175, in _fn
    return fn(*args, **kwargs)
  File "training_loss.py", line 47, in training_iter_fn
    def training_iter_fn(batch, model, optimizer):
  File "/root/miniconda3/lib/python3.8/site-packages/torchdynamo/eval_frame.py", line 175, in _fn
    return fn(*args, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/functorch/_src/aot_autograd.py", line 837, in forward
    return compiled_f(
  File "/root/miniconda3/lib/python3.8/site-packages/functorch/_src/aot_autograd.py", line 828, in new_func
    return compiled_fn(args)
  File "/root/miniconda3/lib/python3.8/site-packages/functorch/_src/aot_autograd.py", line 230, in g
    return f(*args)
  File "/root/miniconda3/lib/python3.8/site-packages/functorch/_src/aot_autograd.py", line 474, in compiled_function
    return CompiledFunction.apply(*remove_dupe_args(args))
  File "/root/miniconda3/lib/python3.8/site-packages/torchdynamo/eval_frame.py", line 175, in _fn
    return fn(*args, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/functorch/_src/aot_autograd.py", line 441, in forward
    fw_outs = call_func_with_args(
  File "/root/miniconda3/lib/python3.8/site-packages/functorch/_src/aot_autograd.py", line 255, in call_func_with_args
    out = normalize_as_list(f(args))
  File "/root/miniconda3/lib/python3.8/site-packages/torchinductor/compile_fx.py", line 170, in run
    return model(new_inputs_to_cuda)
  File "/root/miniconda3/lib/python3.8/site-packages/torchinductor/compile_fx.py", line 187, in run
    compiled_fn = cudagraphify_impl(model, new_inputs, static_input_idxs)
  File "/root/miniconda3/lib/python3.8/site-packages/torchinductor/compile_fx.py", line 245, in cudagraphify_impl
    model(list(static_inputs))
  File "/tmp/torchinductor_root/su/csu2hdc3qbadpnhctc4ottqyl4o3zdovvgtlvtqcqwd2mo5jjvyl.py", line 5024, in call
    kernel38.run(buf308, primals_206, seed_cuda_0, buf311, buf312, buf339, buf342, buf345, buf348, buf351, buf354, buf357, buf360, buf363, buf366, buf369, buf372, 49152, 512, grid=grid(49152), stream=stream0)
  File "/root/miniconda3/lib/python3.8/site-packages/torchinductor/triton_ops/autotune.py", line 160, in run
    result = launcher(
  File "<string>", line 4, in launcher
RuntimeError: Triton Error [CUDA]: invalid argument
soumith commented 2 years ago

hi, can you paste the code that generated this error?

SeaOfOcean commented 2 years ago

I use the original torchdynamo benchmark example https://github.com/pytorch/torchdynamo/blob/main/benchmarks/training_loss.py

python benchmarks/training_loss.py --epoch 1
SeaOfOcean commented 2 years ago

build from master torch resolve this issue