Combined transformer block task graph crash at 2nd run

otabuzzman commented 3 hours ago

Run transformer block on device OpenCL, output layer on PTX:

python %TORNADO_SDK%\bin\tornado ^
--jvm="-Dtb.device=1:0 -Dol.device=2:0 -DUseVectorAPI=true -Dtornado.device.memory=2GB" ^
--classpath bin com.otabuzzman.llmj.TestGpt2

Output:

WARNING: Using incubator modules: jdk.incubator.vector
[GPT-2]
max_seq_len: 1024
vocab_size: 50257
padded_vocab_size: 50304
num_layers: 12
num_heads:12
channels: 768
num_parameters: 124475904
[State]
batch_size: 4
seq_len: 64
num_activations: 73347840
final matmul forward took 630 ms
forward pass took 2390 ms
initial matmul_backward took 16388 ms
backward pass took 65557 ms
-43,431618, -43,431763
-39,836346, -39,836475
-43,065910, -43,066040
-42,828045, -42,828159
-43,529541, -43,529667
-44,318398, -44,318520
-41,227425, -41,227551
-41,270760, -41,270901
-42,541393, -42,541542
-42,394997, -42,395138
OK (LOGITS), max_diff = 1,907349e-03
LOSS OK: 5,269892 5,270009
dwte
OK -0,002320 -0,002320
OK 0,002072 0,002072
OK 0,003716 0,003717
OK 0,001307 0,001307
OK 0,000631 0,000632
TENSOR OK, maxdiff = 1,362801e-03
dwpe
OK -0,005118 -0,005110
OK -0,000001 -0,000012
OK -0,003267 -0,003262
OK 0,009909 0,009909
OK 0,002155 0,002145
TENSOR OK, maxdiff = 5,424581e-05
dln1w
OK -0,007520 -0,007523
OK 0,008624 0,008643
OK 0,005004 0,005029
OK -0,011098 -0,011095
OK -0,001667 -0,001664
TENSOR OK, maxdiff = 3,607247e-03
dln1b
OK -0,038494 -0,038458
OK -0,030547 -0,030600
OK 0,010189 0,010223
OK 0,080134 0,080176
OK -0,060990 -0,060901
TENSOR OK, maxdiff = 1,532211e-03
dqkvw
OK -0,000031 -0,000031
OK -0,000026 -0,000025
OK -0,000064 -0,000064
OK 0,000074 0,000074
OK 0,000020 0,000020
TENSOR OK, maxdiff = 5,578995e-04
dqkvb
OK -0,000414 -0,000411
OK -0,000410 -0,000412
OK 0,000113 0,000113
OK -0,000564 -0,000565
OK 0,000574 0,000570
TENSOR OK, maxdiff = 3,139526e-04
dattprojw
OK 0,000081 0,000080
OK -0,000005 -0,000005
OK -0,000019 -0,000019
OK 0,000005 0,000004
OK 0,000031 0,000031
TENSOR OK, maxdiff = 2,254825e-04
dattprojb
OK 0,000456 0,000470
OK -0,009969 -0,009979
OK -0,001794 -0,001804
OK 0,037638 0,037584
OK -0,031287 -0,031239
TENSOR OK, maxdiff = 2,023876e-04
dln2w
OK -0,018372 -0,018312
OK 0,004811 0,004813
OK 0,008084 0,008091
OK -0,001465 -0,001470
OK -0,002740 -0,002737
TENSOR OK, maxdiff = 1,153964e-02
dln2b
OK -0,026405 -0,026368
OK -0,016712 -0,016695
OK 0,001067 0,001074
OK 0,034754 0,034711
OK -0,028630 -0,028584
TENSOR OK, maxdiff = 9,744540e-04
dfcw
OK 0,000438 0,000440
OK -0,000000 -0,000000
OK -0,000153 -0,000154
OK -0,000165 -0,000165
OK 0,000404 0,000405
TENSOR OK, maxdiff = 9,584501e-04
dfcb
OK 0,003282 0,003293
OK 0,002038 0,002043
OK -0,001386 -0,001386
OK 0,000381 0,000386
OK 0,001602 0,001604
TENSOR OK, maxdiff = 2,334719e-04
dfcprojw
OK 0,000678 0,000681
OK 0,000073 0,000073
OK -0,000415 -0,000416
OK -0,000059 -0,000061
OK -0,000603 -0,000604
TENSOR OK, maxdiff = 4,583277e-04
dfcprojb
OK 0,003573 0,003584
OK -0,007148 -0,007158
OK -0,001955 -0,001964
OK 0,001466 0,001462
OK 0,001219 0,001217
TENSOR OK, maxdiff = 1,408812e-04
dlnfw
OK -0,000022 -0,000022
OK 0,000811 0,000811
OK 0,001161 0,001161
OK -0,002956 -0,002957
OK 0,001146 0,001145
TENSOR OK, maxdiff = 3,452301e-04
dlnfb
OK -0,011101 -0,011101
OK 0,008007 0,008007
OK -0,004763 -0,004769
OK -0,002110 -0,002113
OK -0,005903 -0,005905
TENSOR OK, maxdiff = 6,377231e-05
step 0: loss 5,269892 (took 71425 ms) OK = true
final matmul forward took 633 ms
forward pass took 3267 ms
initial matmul_backward took 17774 ms
backward pass took 64116 ms
step 1: loss 4,059391 (took 67682 ms) OK = true
        [TornadoVM-PTX-JNI] ERROR : cuMemAlloc -> Returned: 2
        [TornadoVM-PTX-JNI] ERROR : cuMemcpyHtoDAsyncMemSeg -> Returned: 1
        [TornadoVM-PTX-JNI] ERROR : cuModuleLoadData -> Returned: 700
PTX to cubin JIT compilation failed! (700)
[Bailout] Running the sequential implementation. Enable --debug to see the reason.
        [TornadoVM-PTX-JNI] ERROR : cuStreamSynchronize -> Returned: 700
final matmul forward took 21431 ms
        [TornadoVM-PTX-JNI] ERROR : cuEventDestroy -> Returned: 700
        [TornadoVM-PTX-JNI] ERROR : cuEventDestroy -> Returned: 700
        [TornadoVM-PTX-JNI] ERROR : cuEventDestroy -> Returned: 700
        [TornadoVM-PTX-JNI] ERROR : cuEventDestroy -> Returned: 700
        [TornadoVM-PTX-JNI] ERROR : cuEventDestroy -> Returned: 700
        [TornadoVM-PTX-JNI] ERROR : cuEventDestroy -> Returned: 700
        [TornadoVM-PTX-JNI] ERROR : cuEventDestroy -> Returned: 700
        [TornadoVM-PTX-JNI] ERROR : cuEventDestroy -> Returned: 700
        [TornadoVM-PTX-JNI] ERROR : cuStreamDestroy -> Returned: 700
        [JNI] C:\Users\iuerg\lab\TornadoVM\tornado-drivers\ptx-jni\target\windows-amd64-release\sources\source\PTXStream.cpp:181 in function: free_staging_block result = 700
        [TornadoVM-PTX-JNI] ERROR : cuMemFree -> Returned: 700
        [TornadoVM-PTX-JNI] ERROR : cuModuleUnload -> Returned: 700
forward pass took 23907 ms
initial matmul_backward took 12558 ms
backward pass took 45374 ms

No errors with transformer block on device PTX, output layer on PTX. Both runs require OpenCL, PTX if clause in attention_forward.

otabuzzman commented 3 hours ago

Run transformer block on device SPIR-V and PTX yields no error. Output layer PTX for both. Requires SPIR-V, PTX if clause in attention_forward.

otabuzzman commented 2 hours ago

Same behaviours with split transformer block graph.

otabuzzman / llm.java

Combined transformer block task graph crash at 2nd run #4