HDCharles commented 5 months ago

running eval/generate after exporting to .so vs not give verrry different (and unintelligible in the case of the .so) results.

Repro

############################################### export MODEL_REPO=meta-llama/Llama-2-7b-hf python export.py --checkpoint-path checkpoints/$MODEL_REPO/model.pth \ --dtype bfloat16 --device cuda \ --output-dso-path checkpoints/$MODEL_REPO/model.so python generate.py --checkpoint-path checkpoints/$MODEL_REPO/model.pth \ --dtype bfloat16 --device cuda \ --dso-path checkpoints/$MODEL_REPO/model.so python generate.py --checkpoint-path checkpoints/$MODEL_REPO/model.pth \ --dtype bfloat16 --device cuda #########################################################

Result

two python export.py --checkpoint-path checkpoints/$MODEL_REPO/model.pth --dtype bfloat16 --device cuda --output-dso-path checkpoints/$MODEL_REPO/model.so

Warning: command export does not support option pte-path Warning: command export does not support option dso-path Namespace(seed=1234, prompt='Hello, my name is', tiktoken=False, chat=False, is_chat_model=False, gui=False, num_samples=1, max_new_tokens=200, top_k=200, temperature=0.8, compile=False, compile_prefill=False, profile=None, speculate_k=5, draft_checkpoint_path=None, checkpoint_path=PosixPath('checkpoints/meta-llama/Llama-2-7b-hf/model.pth'), checkpoint_dir=None, params_path=None, gguf_path=None, tokenizer_path=None, output_pte_path=None, output_dso_path='checkpoints/meta-llama/Llama-2-7b-hf/model.so', dso_path=None, pte_path=None, dtype='bfloat16', verbose=False, quantize='{ }', params_table=None, device='cuda', tasks=['wikitext'], limit=None, max_seq_length=None) Using device=cuda Loading model ... name Llama-2-7b-hf Time to load model: 11.23 seconds Time to quantize model: 0.00 seconds Exporting model using AOT Inductor to /home/cdhernandez/local/torchchat/checkpoints/meta-llama/Llama-2-7b-hf/model.so len(input)=2 The generated DSO model can be found at: /home/cdhernandez/local/torchchat/checkpoints/meta-llama/Llama-2-7b-hf/model.so ################################################# python generate.py --checkpoint-path checkpoints/$MODEL_REPO/model.pth \ --dtype bfloat16 --device cuda \ --dso-path checkpoints/$MODEL_REPO/model.so Warning: command generate does not support option output-pte-path Warning: command generate does not support option output-dso-path Warning: checkpoint path ignored because an exported DSO or PTE path specified Warning: checkpoint path ignored because an exported DSO or PTE path specified Loading model ... name Llama-2-7b-hf Time to load model: 11.80 seconds /home/cdhernandez/local/pytorch/torch/backends/cuda/init.py:342: FutureWarning: torch.backends.cuda.sdp_kernel() is deprecated. In the future, this context manager will be removed. Please see, torch.nn.attention.sdpa_kernel() for the new context manager, with updated signature. warnings.warn( Hello, my name is an an an, we have have is keeping a aka is the people, 14T 3 may have the same, perhaps the every risk taking I and the 2 certain he Daniel T: it =L0man an long T: Trans 1 when'good became, so as Take Care Principal mayer’looking in this’Future 4. n...g thefian (the is a. l8, we cour (O looked en. (U 91 99f784ne Re, maybe we any FY CH3, no200da–i/l3s (blue in 6-Work, sometimes run” to 200-turn. M in An/l3 and 4 will (asenthal.patchd their committee of the cannab, u22000da - a smell-R (R, told. TheWell?meb ########################################################### python generate.py --checkpoint-path checkpoints/$MODEL_REPO/model.pth \ --dtype bfloat16 --device cuda \

Warning: command generate does not support option output-pte-path Warning: command generate does not support option output-dso-path Loading model ... name Llama-2-7b-hf Time to load model: 10.79 seconds Time to quantize model: 0.00 seconds /home/cdhernandez/local/pytorch/torch/backends/cuda/init.py:342: FutureWarning: torch.backends.cuda.sdp_kernel() is deprecated. In the future, this context manager will be removed. Please see, torch.nn.attention.sdpa_kernel() for the new context manager, with updated signature. warnings.warn( Hello, my name is TJ. I am a stay at home dad with two kids under the age of 3. I am just starting to dabble in the world of 3D printing and I am hoping to learn as much as possible from the community. I have designed and printed a few things so far and I am hoping to share some of my designs in the near future. Welcome to the forum. If you're in the UK we should get along swimmingly. What area are you in? Thanks for the warm welcome! I'm in the Finger Lakes, NY about 2 hours from Rochester. I'm in Middlesbrough, near the North Sea in the UK. MATCO Electronics has developed the next generation of the MATCO Automated Test Stand. Designed with the latest technology, it includes a powerful Intel Core i3 CPU and 500GB Solid State Drive.

guangy10 commented 5 months ago

What is the version of torch? On A100 or something else?

guangy10 commented 5 months ago

I can not reproduce it, with latest torch nightly, float16 7b model using cuda w/ aoti works just fine.

bash ./scripts/workflow.sh "cuda" "openlm-research/open_llama_7b" "aoti"

###############################################################
############## Start LLama-fast Model Validation ##############
###############################################################

############### Validating open_llama_7b ###############
Checkpoint directory for openlm-research/open_llama_7b is not empty. Skipping download.
Converted checkpoint already exists. Skipping conversion for openlm-research/open_llama_7b.
############### Run inference with AOT Inductor for dtype bfloat16 ###############

******************************************
************** non-quantized *************
******************************************
Warning: command export does not support option pte-path
Warning: command export does not support option dso-path
Using device=cuda
Loading model ...
name open_llama_7b
Time to load model: 3.55 seconds
Time to quantize model: 0.00 seconds
Exporting model using AOT Inductor to /data/users/guangyang/torchat/checkpoints/openlm-research/open_llama_7b/model.so
len(input)=2
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379] Triton compilation failed: triton_poi_fused__scaled_dot_product_efficient_attention_index_put_2
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379] def triton_poi_fused__scaled_dot_product_efficient_attention_index_put_2(in_ptr0, in_ptr1, in_ptr2, in_ptr3, in_ptr4, out_ptr0, out_ptr1, out_ptr2, xnumel, XBLOCK : tl.constexpr):
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     xoffset = tl.program_id(0) * XBLOCK
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     xindex = xoffset + tl.arange(0, XBLOCK)[:]
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     xmask = xindex < xnumel
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     x2 = (xindex // 4096)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     x0 = xindex % 128
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     x3 = (xindex // 128)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     x1 = (xindex // 128) % 32
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     x4 = xindex
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp0 = tl.load(in_ptr0 + (x2), None, eviction_policy='evict_last')
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp48 = tl.load(in_ptr3 + (x4), None).to(tl.float32)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp1 = tmp0 + 352
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp2 = tmp0 < 0
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp3 = tl.where(tmp2, tmp1, tmp0)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tl.device_assert((0 <= tmp3) & (tmp3 < 352), "index out of bounds: 0 <= tmp3 < 352")
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp4 = x0 % 2
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp5 = tl.full([1], 0, tl.int64)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp6 = tmp4 >= tmp5
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp7 = tl.full([1], 1, tl.int64)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp8 = tmp4 < tmp7
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp9 = tl.load(in_ptr1 + ((2*(x0 // 2)) + (128*x3)), tmp8, eviction_policy='evict_last', other=0.0).to(tl.float32)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp10 = tmp9.to(tl.float32)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp11 = tl.load(in_ptr0 + (x2), tmp8, eviction_policy='evict_last', other=0.0)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp12 = tmp11 + 4096
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp13 = tmp11 < 0
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp14 = tl.where(tmp13, tmp12, tmp11)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tl.device_assert(((0 <= tmp14) & (tmp14 < 4096)) | ~tmp8, "index out of bounds: 0 <= tmp14 < 4096")
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp15 = tl.load(in_ptr2 + ((2*(x0 // 2)) + (128*tmp14)), tmp8, eviction_policy='evict_last', other=0.0).to(tl.float32)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp16 = tmp15.to(tl.float32)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp17 = tmp10 * tmp16
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp18 = tl.load(in_ptr1 + (1 + (2*(x0 // 2)) + (128*x3)), tmp8, eviction_policy='evict_last', other=0.0).to(tl.float32)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp19 = tmp18.to(tl.float32)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp20 = tl.load(in_ptr2 + (1 + (2*(x0 // 2)) + (128*tmp14)), tmp8, eviction_policy='evict_last', other=0.0).to(tl.float32)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp21 = tmp20.to(tl.float32)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp22 = tmp19 * tmp21
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp23 = tmp17 - tmp22
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp24 = tl.full(tmp23.shape, 0.0, tmp23.dtype)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp25 = tl.where(tmp8, tmp23, tmp24)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp26 = tmp4 >= tmp7
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp27 = tl.full([1], 2, tl.int64)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp28 = tmp4 < tmp27
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp29 = tl.load(in_ptr1 + (1 + (2*(x0 // 2)) + (128*x3)), tmp26, eviction_policy='evict_last', other=0.0).to(tl.float32)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp30 = tmp29.to(tl.float32)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp31 = tl.load(in_ptr0 + (x2), tmp26, eviction_policy='evict_last', other=0.0)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp32 = tmp31 + 4096
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp33 = tmp31 < 0
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp34 = tl.where(tmp33, tmp32, tmp31)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tl.device_assert(((0 <= tmp34) & (tmp34 < 4096)) | ~tmp26, "index out of bounds: 0 <= tmp34 < 4096")
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp35 = tl.load(in_ptr2 + ((2*(x0 // 2)) + (128*tmp34)), tmp26, eviction_policy='evict_last', other=0.0).to(tl.float32)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp36 = tmp35.to(tl.float32)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp37 = tmp30 * tmp36
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp38 = tl.load(in_ptr1 + ((2*(x0 // 2)) + (128*x3)), tmp26, eviction_policy='evict_last', other=0.0).to(tl.float32)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp39 = tmp38.to(tl.float32)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp40 = tl.load(in_ptr2 + (1 + (2*(x0 // 2)) + (128*tmp34)), tmp26, eviction_policy='evict_last', other=0.0).to(tl.float32)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp41 = tmp40.to(tl.float32)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp42 = tmp39 * tmp41
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp43 = tmp37 + tmp42
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp44 = tl.full(tmp43.shape, 0.0, tmp43.dtype)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp45 = tl.where(tmp26, tmp43, tmp44)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp46 = tl.where(tmp8, tmp25, tmp45)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp47 = tmp46.to(tl.float32)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp49 = tl.load(in_ptr4 + ((2*(x0 // 2)) + (128*x3)), tmp8, eviction_policy='evict_last', other=0.0).to(tl.float32)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp50 = tmp49.to(tl.float32)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp51 = tmp50 * tmp16
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp52 = tl.load(in_ptr4 + (1 + (2*(x0 // 2)) + (128*x3)), tmp8, eviction_policy='evict_last', other=0.0).to(tl.float32)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp53 = tmp52.to(tl.float32)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp54 = tmp53 * tmp21
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp55 = tmp51 - tmp54
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp56 = tl.full(tmp55.shape, 0.0, tmp55.dtype)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp57 = tl.where(tmp8, tmp55, tmp56)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp58 = tl.load(in_ptr4 + (1 + (2*(x0 // 2)) + (128*x3)), tmp26, eviction_policy='evict_last', other=0.0).to(tl.float32)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp59 = tmp58.to(tl.float32)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp60 = tmp59 * tmp36
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp61 = tl.load(in_ptr4 + ((2*(x0 // 2)) + (128*x3)), tmp26, eviction_policy='evict_last', other=0.0).to(tl.float32)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp62 = tmp61.to(tl.float32)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp63 = tmp62 * tmp41
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp64 = tmp60 + tmp63
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp65 = tl.full(tmp64.shape, 0.0, tmp64.dtype)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp66 = tl.where(tmp26, tmp64, tmp65)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp67 = tl.where(tmp8, tmp57, tmp66)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp68 = tmp67.to(tl.float32)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tl.store(out_ptr0 + (x0 + (128*tmp3) + (45056*x1)), tmp47, None)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tl.store(out_ptr1 + (x0 + (128*tmp3) + (45056*x1)), tmp48, None)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tl.store(out_ptr2 + (x4), tmp68, None)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379] metadata: {'signature': {0: '*i32', 1: '*bf16', 2: '*bf16', 3: '*bf16', 4: '*bf16', 5: '*bf16', 6: '*bf16', 7: '*bf16', 8: 'i32'}, 'device': 0, 'device_type': 'cuda', 'constants': {9: 512}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2, 3, 4, 5, 6, 7, 8), equal_to_1=())], 'num_warps': 4, 'num_stages': 1, 'debug': True, 'cc': 90}
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379] Traceback (most recent call last):
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]   File "/home/guangyang/.conda/envs/torchat/lib/python3.11/site-packages/triton/backends/nvidia/compiler.py", line 239, in make_cubin
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     subprocess.run(cmd, shell=True, check=True)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]   File "/home/guangyang/.conda/envs/torchat/lib/python3.11/subprocess.py", line 569, in run
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     raise CalledProcessError(retcode, process.args,
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379] subprocess.CalledProcessError: Command '/home/guangyang/.conda/envs/torchat/lib/python3.11/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_90a /tmp/tmp2oh7ifxt.ptx -o /tmp/tmp2oh7ifxt.ptx.o 2> /tmp/tmpdegvg5hs.log' returned non-zero exit status 255.
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379] During handling of the above exception, another exception occurred:
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379] Traceback (most recent call last):
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]   File "/home/guangyang/.conda/envs/torchat/lib/python3.11/site-packages/torch/_inductor/triton_heuristics.py", line 377, in _precompile_config
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     binary = triton.compile(*compile_args, **compile_kwargs)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]   File "/home/guangyang/.conda/envs/torchat/lib/python3.11/site-packages/triton/compiler/compiler.py", line 268, in compile
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     next_module = compile_ir(module, metadata)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]   File "/home/guangyang/.conda/envs/torchat/lib/python3.11/site-packages/triton/backends/nvidia/compiler.py", line 267, in <lambda>
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     stages["cubin"] = lambda src, metadata: self.make_cubin(src, metadata, options, self.capability)
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]   File "/home/guangyang/.conda/envs/torchat/lib/python3.11/site-packages/triton/backends/nvidia/compiler.py", line 244, in make_cubin
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]     raise RuntimeError(f'Internal Triton PTX codegen error: \n{log}')
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379] RuntimeError: Internal Triton PTX codegen error:
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379] ptxas /tmp/tmp2oh7ifxt.ptx, line 200; error   : Unexpected instruction types specified for 'sub'
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379] ptxas fatal   : Ptx assembly aborted due to errors
E0419 10:49:42.985000 140704952719168 torch/_inductor/triton_heuristics.py:379]
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379] Triton compilation failed: triton_poi_fused__scaled_dot_product_efficient_attention_index_put_2
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379] def triton_poi_fused__scaled_dot_product_efficient_attention_index_put_2(in_ptr0, in_ptr1, in_ptr2, in_ptr3, in_ptr4, out_ptr0, out_ptr1, out_ptr2, xnumel, XBLOCK : tl.constexpr):
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     xoffset = tl.program_id(0) * XBLOCK
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     xindex = xoffset + tl.arange(0, XBLOCK)[:]
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     xmask = xindex < xnumel
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     x2 = (xindex // 4096)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     x0 = xindex % 128
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     x3 = (xindex // 128)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     x1 = (xindex // 128) % 32
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     x4 = xindex
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp0 = tl.load(in_ptr0 + (x2), None, eviction_policy='evict_last')
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp48 = tl.load(in_ptr3 + (x4), None).to(tl.float32)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp1 = tmp0 + 352
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp2 = tmp0 < 0
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp3 = tl.where(tmp2, tmp1, tmp0)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tl.device_assert((0 <= tmp3) & (tmp3 < 352), "index out of bounds: 0 <= tmp3 < 352")
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp4 = x0 % 2
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp5 = tl.full([1], 0, tl.int64)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp6 = tmp4 >= tmp5
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp7 = tl.full([1], 1, tl.int64)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp8 = tmp4 < tmp7
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp9 = tl.load(in_ptr1 + ((2*(x0 // 2)) + (128*x3)), tmp8, eviction_policy='evict_last', other=0.0).to(tl.float32)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp10 = tmp9.to(tl.float32)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp11 = tl.load(in_ptr0 + (x2), tmp8, eviction_policy='evict_last', other=0.0)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp12 = tmp11 + 4096
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp13 = tmp11 < 0
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp14 = tl.where(tmp13, tmp12, tmp11)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tl.device_assert(((0 <= tmp14) & (tmp14 < 4096)) | ~tmp8, "index out of bounds: 0 <= tmp14 < 4096")
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp15 = tl.load(in_ptr2 + ((2*(x0 // 2)) + (128*tmp14)), tmp8, eviction_policy='evict_last', other=0.0).to(tl.float32)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp16 = tmp15.to(tl.float32)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp17 = tmp10 * tmp16
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp18 = tl.load(in_ptr1 + (1 + (2*(x0 // 2)) + (128*x3)), tmp8, eviction_policy='evict_last', other=0.0).to(tl.float32)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp19 = tmp18.to(tl.float32)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp20 = tl.load(in_ptr2 + (1 + (2*(x0 // 2)) + (128*tmp14)), tmp8, eviction_policy='evict_last', other=0.0).to(tl.float32)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp21 = tmp20.to(tl.float32)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp22 = tmp19 * tmp21
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp23 = tmp17 - tmp22
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp24 = tl.full(tmp23.shape, 0.0, tmp23.dtype)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp25 = tl.where(tmp8, tmp23, tmp24)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp26 = tmp4 >= tmp7
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp27 = tl.full([1], 2, tl.int64)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp28 = tmp4 < tmp27
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp29 = tl.load(in_ptr1 + (1 + (2*(x0 // 2)) + (128*x3)), tmp26, eviction_policy='evict_last', other=0.0).to(tl.float32)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp30 = tmp29.to(tl.float32)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp31 = tl.load(in_ptr0 + (x2), tmp26, eviction_policy='evict_last', other=0.0)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp32 = tmp31 + 4096
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp33 = tmp31 < 0
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp34 = tl.where(tmp33, tmp32, tmp31)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tl.device_assert(((0 <= tmp34) & (tmp34 < 4096)) | ~tmp26, "index out of bounds: 0 <= tmp34 < 4096")
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp35 = tl.load(in_ptr2 + ((2*(x0 // 2)) + (128*tmp34)), tmp26, eviction_policy='evict_last', other=0.0).to(tl.float32)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp36 = tmp35.to(tl.float32)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp37 = tmp30 * tmp36
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp38 = tl.load(in_ptr1 + ((2*(x0 // 2)) + (128*x3)), tmp26, eviction_policy='evict_last', other=0.0).to(tl.float32)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp39 = tmp38.to(tl.float32)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp40 = tl.load(in_ptr2 + (1 + (2*(x0 // 2)) + (128*tmp34)), tmp26, eviction_policy='evict_last', other=0.0).to(tl.float32)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp41 = tmp40.to(tl.float32)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp42 = tmp39 * tmp41
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp43 = tmp37 + tmp42
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp44 = tl.full(tmp43.shape, 0.0, tmp43.dtype)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp45 = tl.where(tmp26, tmp43, tmp44)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp46 = tl.where(tmp8, tmp25, tmp45)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp47 = tmp46.to(tl.float32)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp49 = tl.load(in_ptr4 + ((2*(x0 // 2)) + (128*x3)), tmp8, eviction_policy='evict_last', other=0.0).to(tl.float32)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp50 = tmp49.to(tl.float32)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp51 = tmp50 * tmp16
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp52 = tl.load(in_ptr4 + (1 + (2*(x0 // 2)) + (128*x3)), tmp8, eviction_policy='evict_last', other=0.0).to(tl.float32)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp53 = tmp52.to(tl.float32)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp54 = tmp53 * tmp21
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp55 = tmp51 - tmp54
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp56 = tl.full(tmp55.shape, 0.0, tmp55.dtype)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp57 = tl.where(tmp8, tmp55, tmp56)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp58 = tl.load(in_ptr4 + (1 + (2*(x0 // 2)) + (128*x3)), tmp26, eviction_policy='evict_last', other=0.0).to(tl.float32)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp59 = tmp58.to(tl.float32)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp60 = tmp59 * tmp36
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp61 = tl.load(in_ptr4 + ((2*(x0 // 2)) + (128*x3)), tmp26, eviction_policy='evict_last', other=0.0).to(tl.float32)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp62 = tmp61.to(tl.float32)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp63 = tmp62 * tmp41
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp64 = tmp60 + tmp63
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp65 = tl.full(tmp64.shape, 0.0, tmp64.dtype)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp66 = tl.where(tmp26, tmp64, tmp65)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp67 = tl.where(tmp8, tmp57, tmp66)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tmp68 = tmp67.to(tl.float32)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tl.store(out_ptr0 + (x0 + (128*tmp3) + (45056*x1)), tmp47, None)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tl.store(out_ptr1 + (x0 + (128*tmp3) + (45056*x1)), tmp48, None)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     tl.store(out_ptr2 + (x4), tmp68, None)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379] metadata: {'signature': {0: '*i32', 1: '*bf16', 2: '*bf16', 3: '*bf16', 4: '*bf16', 5: '*bf16', 6: '*bf16', 7: '*bf16', 8: 'i32'}, 'device': 0, 'device_type': 'cuda', 'constants': {9: 256}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2, 3, 4, 5, 6, 7, 8), equal_to_1=())], 'num_warps': 2, 'num_stages': 1, 'debug': True, 'cc': 90}
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379] Traceback (most recent call last):
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]   File "/home/guangyang/.conda/envs/torchat/lib/python3.11/site-packages/triton/backends/nvidia/compiler.py", line 239, in make_cubin
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     subprocess.run(cmd, shell=True, check=True)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]   File "/home/guangyang/.conda/envs/torchat/lib/python3.11/subprocess.py", line 569, in run
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     raise CalledProcessError(retcode, process.args,
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379] subprocess.CalledProcessError: Command '/home/guangyang/.conda/envs/torchat/lib/python3.11/site-packages/triton/backends/nvidia/bin/ptxas -lineinfo -v --gpu-name=sm_90a /tmp/tmpxkr9afvx.ptx -o /tmp/tmpxkr9afvx.ptx.o 2> /tmp/tmp2vx8bmik.log' returned non-zero exit status 255.
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379] During handling of the above exception, another exception occurred:
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379] Traceback (most recent call last):
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]   File "/home/guangyang/.conda/envs/torchat/lib/python3.11/site-packages/torch/_inductor/triton_heuristics.py", line 377, in _precompile_config
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     binary = triton.compile(*compile_args, **compile_kwargs)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]   File "/home/guangyang/.conda/envs/torchat/lib/python3.11/site-packages/triton/compiler/compiler.py", line 268, in compile
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     next_module = compile_ir(module, metadata)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]   File "/home/guangyang/.conda/envs/torchat/lib/python3.11/site-packages/triton/backends/nvidia/compiler.py", line 267, in <lambda>
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     stages["cubin"] = lambda src, metadata: self.make_cubin(src, metadata, options, self.capability)
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]   File "/home/guangyang/.conda/envs/torchat/lib/python3.11/site-packages/triton/backends/nvidia/compiler.py", line 244, in make_cubin
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]     raise RuntimeError(f'Internal Triton PTX codegen error: \n{log}')
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379] RuntimeError: Internal Triton PTX codegen error:
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379] ptxas /tmp/tmpxkr9afvx.ptx, line 200; error   : Unexpected instruction types specified for 'sub'
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379] ptxas fatal   : Ptx assembly aborted due to errors
E0419 10:49:43.311000 140704952719168 torch/_inductor/triton_heuristics.py:379]
The generated DSO model can be found at: /data/users/guangyang/torchat/checkpoints/openlm-research/open_llama_7b/model.so
Warning: command generate does not support option output-pte-path
Warning: command generate does not support option output-dso-path
Warning: checkpoint path ignored because an exported DSO or PTE path specified
Warning: checkpoint path ignored because an exported DSO or PTE path specified
Loading model ...
name open_llama_7b
Time to load model: 3.68 seconds
Hello, my name is Iris.
My name is Iris. I am 14 years old. I want to paint. I paint from my heart and I enjoy to make art. You can find me in my school time. I am going to high school now. I like it very much! I want to learn to draw. I am good at drawing and I really enjoy to make art and also like to paint. I also love to dance. I like to go to parties. I have a lot of friends. My friends love me and I like my friends. I am very happy with my friends. I am very happy to have a lot of friends. I like to play with my friends. I like to spend time with my friends. I love to take a lot of photos of my friends. I love to take beautiful pictures and I also use my camera to take pictures. I like to make lots of photos. I am very happy with my friends. I am very happy for having a

desertfire commented 5 months ago

What is the difference between "meta-llama/Llama-2-7b-hf" and "openlm-research/open_llama_7b"?

mikekgfb commented 5 months ago

What is the difference between "meta-llama/Llama-2-7b-hf" and "openlm-research/open_llama_7b"?

open_llama_7b is a non-meta llama-lookalike trained by the OSS community. Conveniently, it does not require a token to access and makes it an attractive stand-in for use in CI

HDCharles commented 5 months ago

i nuked my env and it looks like its working now, going to close

pytorch / torchchat

AOTI generated .so models output incorrect values #294

Repro

Result