pytorch-labs / gpt-fast

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
BSD 3-Clause "New" or "Revised" License
5.34k stars 484 forks source link

CUDA error if enabling compile_prefill for quantization model (int8) #137

Open yanboliang opened 3 months ago

yanboliang commented 3 months ago

Repro command:

python generate.py --compile --compile_prefill --checkpoint_path checkpoints/$MODEL_REPO/model_int8.pth

Errors:

(pt) [ybliang@devgpu002.ash8 ~/local/gpt-fast (main)]$ python generate.py --compile --compile_prefill --checkpoint_path checkpoints/$MODEL_REPO/model_int8.pth
/home/ybliang/local/miniconda3/envs/pt/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
Using device=cuda
Loading model ...
Using int8 weight-only quantization!
Time to load model: 6.15 seconds
/home/ybliang/local/pytorch/torch/backends/cuda/__init__.py:342: FutureWarning: torch.backends.cuda.sdp_kernel() is deprecated. In the future, this context manager will be removed. Please see, torch.nn.attention.sdpa_kernel() for the new context manager, with updated signature.
  warnings.warn(
unknown:0: unknown: block: [0,0,0], thread: [128,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [129,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [130,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [131,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [132,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [133,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [134,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [135,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [136,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [137,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [138,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [139,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [140,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [141,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [142,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [143,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [144,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [145,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [146,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [147,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [148,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [149,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [150,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [151,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [152,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [153,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [154,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [155,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [156,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [157,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [158,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [159,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [192,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [193,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [194,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [195,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [196,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [197,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [198,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [199,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [200,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [201,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [202,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [203,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [204,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [205,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [206,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [207,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [208,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [209,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [210,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [211,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [212,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [213,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [214,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [215,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [216,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [217,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [218,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [219,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [220,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [221,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [222,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [223,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [160,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [161,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [162,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [163,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [164,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [165,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [166,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [167,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [168,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [169,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [170,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [171,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [172,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [173,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [174,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [175,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [176,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [177,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [178,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [179,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [180,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [181,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [182,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [183,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [184,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [185,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [186,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [187,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [188,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [189,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [190,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [191,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [64,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [65,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [66,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [67,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [68,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [69,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [70,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [71,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [72,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [73,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [74,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [75,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [76,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [77,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [78,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [79,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [80,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [81,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [82,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [83,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [84,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [85,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [86,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [87,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [88,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [89,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [90,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [91,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [92,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [93,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [94,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [95,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [224,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [225,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [226,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [227,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [228,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [229,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [230,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [231,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [232,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [233,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [234,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [235,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [236,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [237,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [238,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [239,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [240,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [241,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [242,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [243,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [244,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [245,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [246,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [247,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [248,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [249,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [250,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [251,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [252,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [253,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [254,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [255,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [32,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [33,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [34,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [35,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [36,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [37,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [38,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [39,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [40,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [41,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [42,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [43,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [44,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [45,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [46,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [47,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [48,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [49,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [50,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [51,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [52,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [53,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [54,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [55,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [56,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [57,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [58,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [59,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [60,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [61,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [62,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [63,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [0,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [1,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [2,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [3,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [4,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [5,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [6,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [7,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [8,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [9,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [10,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [11,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [12,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [13,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [14,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [15,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [16,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [17,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [18,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [19,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [20,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [21,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [22,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [23,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [24,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [25,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [26,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [27,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [28,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [29,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [30,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [31,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [96,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [97,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [98,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [99,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [100,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [101,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [102,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [103,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [104,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [105,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [106,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [107,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [108,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [109,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [110,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [111,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [112,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [113,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [114,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [115,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [116,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [117,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [118,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [119,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [120,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [121,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [122,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [123,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [124,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [125,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [126,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
unknown:0: unknown: block: [0,0,0], thread: [127,0,0] Assertion `index out of bounds: 0 <= tmp4 < 32000` failed.
Traceback (most recent call last):
  File "/data/users/ybliang/gpt-fast/generate.py", line 421, in <module>
    main(
  File "/data/users/ybliang/gpt-fast/generate.py", line 359, in main
    y, metrics = generate(
  File "/home/ybliang/local/pytorch/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data/users/ybliang/gpt-fast/generate.py", line 202, in generate
    generated_tokens, _ = decode_n_tokens(model, next_token.view(1, -1), input_pos, max_new_tokens - 1, callback=callback, **sampling_kwargs)
  File "/data/users/ybliang/gpt-fast/generate.py", line 74, in decode_n_tokens
    next_token, next_prob = decode_one_token(
  File "/home/ybliang/local/pytorch/torch/_dynamo/eval_frame.py", line 450, in _fn
    return fn(*args, **kwargs)
  File "/data/users/ybliang/gpt-fast/generate.py", line 64, in decode_one_token
    def decode_one_token(model: Transformer, x: torch.Tensor, input_pos: torch.Tensor, **sampling_kwargs) -> Tuple[torch.Tensor, torch.Tensor]:
  File "/home/ybliang/local/pytorch/torch/_dynamo/eval_frame.py", line 450, in _fn
    return fn(*args, **kwargs)
  File "/home/ybliang/local/pytorch/torch/_dynamo/external_utils.py", line 36, in inner
    return fn(*args, **kwargs)
  File "/home/ybliang/local/pytorch/torch/_functorch/aot_autograd.py", line 917, in forward
    return compiled_fn(full_args)
  File "/home/ybliang/local/pytorch/torch/_functorch/_aot_autograd/utils.py", line 89, in g
    return f(*args)
  File "/home/ybliang/local/pytorch/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 106, in runtime_wrapper
    all_outs = call_func_at_runtime_with_args(
  File "/home/ybliang/local/pytorch/torch/_functorch/_aot_autograd/utils.py", line 113, in call_func_at_runtime_with_args
    out = normalize_as_list(f(args))
  File "/home/ybliang/local/pytorch/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 152, in rng_functionalization_wrapper
    return compiled_fw(args)
  File "/home/ybliang/local/pytorch/torch/_inductor/codecache.py", line 906, in __call__
    return self.get_current_callable()(inputs)
  File "/home/ybliang/local/pytorch/torch/_inductor/compile_fx.py", line 838, in run
    return compiled_fn(new_inputs)
  File "/home/ybliang/local/pytorch/torch/_inductor/cudagraph_trees.py", line 383, in deferred_cudagraphify
    fn, out = cudagraphify(model, inputs, new_static_input_idxs, *args, **kwargs)
  File "/home/ybliang/local/pytorch/torch/_inductor/cudagraph_trees.py", line 411, in cudagraphify
    return manager.add_function(
  File "/home/ybliang/local/pytorch/torch/_inductor/cudagraph_trees.py", line 1943, in add_function
    return fn, fn(inputs)
  File "/home/ybliang/local/pytorch/torch/_inductor/cudagraph_trees.py", line 1757, in run
    out = self._run(new_inputs, function_id)
  File "/home/ybliang/local/pytorch/torch/_inductor/cudagraph_trees.py", line 1798, in _run
    return self.run_eager(new_inputs, function_id)
  File "/home/ybliang/local/pytorch/torch/_inductor/cudagraph_trees.py", line 1913, in run_eager
    return node.run(new_inputs)
  File "/home/ybliang/local/pytorch/torch/_inductor/cudagraph_trees.py", line 616, in run
    out = self.wrapped_function.model(new_inputs)
  File "/home/ybliang/local/pytorch/torch/_inductor/codecache.py", line 934, in _run_from_cache
    return compiled_graph.compiled_artifact(inputs)
  File "/tmp/torchinductor_ybliang/mi/cmiek2ltsrliaqercc2b6xcfebjyeel2kxpgdgc65xbyxpekhh5j.py", line 2020, in call
    triton_red_fused_add_bmm_embedding_mm_mul_11.run(buf19, arg75_1, buf20, arg77_1, arg78_1, arg455_1, arg65_1, buf16, arg73_1, arg79_1, buf22, 4096, 11008, grid=grid(4096), stream=stream0)
  File "/home/ybliang/local/pytorch/torch/_inductor/triton_heuristics.py", line 635, in run
    self.autotune_to_one_config(*args, grid=grid, **kwargs)
  File "/home/ybliang/local/pytorch/torch/_inductor/triton_heuristics.py", line 531, in autotune_to_one_config
    timings = self.benchmark_all_configs(*args, **kwargs)
  File "/home/ybliang/local/pytorch/torch/_dynamo/utils.py", line 262, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/ybliang/local/pytorch/torch/_inductor/triton_heuristics.py", line 507, in benchmark_all_configs
    timings = {
  File "/home/ybliang/local/pytorch/torch/_inductor/triton_heuristics.py", line 508, in <dictcomp>
    launcher: self.bench(launcher, *args, **kwargs)
  File "/home/ybliang/local/pytorch/torch/_inductor/triton_heuristics.py", line 479, in bench
    return do_bench(kernel_call, rep=40, fast_flush=True)
  File "/home/ybliang/local/pytorch/torch/_inductor/utils.py", line 170, in do_bench
    return triton_do_bench(*args, **kwargs)[0]
  File "/data/users/ybliang/triton/python/triton/testing.py", line 101, in do_bench
    torch.cuda.synchronize()
  File "/home/ybliang/local/pytorch/torch/cuda/__init__.py", line 792, in synchronize
    return torch._C._cuda_synchronize()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

generated kernel file: https://gist.github.com/yanboliang/6f5c1171e63909b995b5372dc7c88ab7

jerrymannil commented 3 months ago

Seeing similar issues with AMD gpus as well. With AMD GPUs, we are seeing a memory fault rather that device assertions. Looks like kernels generated for AMD doesn't have these device asserts.

Memory access fault by GPU node-3 (Agent handle: 0x80e6680) on address 0x7eff45229000. Reason: Unknown.
Aborted (core dumped)
jerrymannil commented 3 months ago

Observations:

  1. Running with "--compile_prefill" alone without "--compile" can run fine (i.e I had to move prefill compile outside of if compile check
  2. The error happens during the first kernel run for decode
  3. The generated wrapper code can run by itself without this error.

So it seems to me the error is related to some interactions b/w the compiled prefill and decode kernels

jerrymannil commented 3 months ago

Looks like prefill compile can work, if I change next_token.view(1, -1) to next_token.clone().view(1, -1) here

griff4692 commented 2 weeks ago

Is there a resolution to this problem for --compile only? I am still getting it

pytorch-triton==3.0.0+45fff310c8
torch==2.4.0.dev20240527+cu121
torchaudio==2.2.0.dev20240528+cu121
torchvision==0.19.0.dev20240528+cu121
yanboliang commented 2 weeks ago

@griff4692 Does https://github.com/pytorch-labs/gpt-fast/issues/137#issuecomment-2025959457 work?

griff4692 commented 2 weeks ago

@griff4692 Does #137 (comment) work?

Nope unfortunately -- it looks like in current code the next_token is already cloned anyway

https://github.com/pytorch-labs/gpt-fast/issues/137#issuecomment-2025959457

yanboliang commented 2 weeks ago

@griff4692 It seems you hit a different issue other than this one, I tried your command and it works well at gpt-fast. So I suspect it's some change on the context-compression side triggered a cudagraph error. I'm looking at which triggers it now.