taichi-dev / taichi

Productive, portable, and performant GPU programming in Python.
https://taichi-lang.org
Apache License 2.0
25.45k stars 2.28k forks source link

Advanced optimization #656

Closed xumingkuan closed 2 years ago

xumingkuan commented 4 years ago

Concisely describe the proposed feature With new extensions introduced by #581, there are lots of space to optimize the IR. I also found some feasible optimizations that are not directly related to the new extension. For example, in this fragment of IR,

...
<f32 x1> $5 = alloca
if $26 {
  ...
} else {
  ...
}
if $26 {
  ...
} else {
  ...
}
<f32 x1> $83 = local load [ [$5[0]]] (the only statement about $5)
...

we could merge the two if's together, change $83 to const [0], and then delete $5.

A list of optimizations I have done and going to do:

Additional comments For benchmarking, we may want to introduce a temporary boolean variable as the switch of optimization.

Some nice slides: https://courses.cs.washington.edu/courses/cse401/08wi/lecture/opt-mark.v2.pdf

xumingkuan commented 4 years ago

I just found that test_tensor_reflection__test_POT has no kernels in it. Figuring out what's wrong...

Maybe we shouldn't have printed stats in the destructor of Program. We compile some kernels after Program finalizes, so we've got some dislocations on statistics and test names.

Look at this log (Windows fatal exception always appears at the beginning, and it seems doesn't matter):

C:\Users\xmk\Desktop\taichi\tests\python>pytest -s test_tensor_dimensionality.py
============================================================================== test session starts ===============================================================================
platform win32 -- Python 3.7.4, pytest-5.0.1, py-1.8.0, pluggy-0.12.0
rootdir: C:\Users\xmk\Desktop\taichi\tests\python
plugins: arraydiff-0.3, doctestplus-0.3.0, openfiles-0.3.2, remotedata-0.3.2
collecting ... [Taichi] mode=development
[Taichi] <dev mode>, supported archs: [cpu only], commit d7610347, python 3.7.4
Windows fatal exception: code 0xc0000138

Current thread 0x00002798 (most recent call first):
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 1043 in create_module
  File "<frozen importlib._bootstrap>", line 583 in module_from_spec
  File "<frozen importlib._bootstrap>", line 670 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 967 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 983 in _find_and_load
  File "D:\Anaconda3\lib\site-packages\torch\__init__.py", line 81 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 728 in exec_module
  File "<frozen importlib._bootstrap>", line 677 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 967 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 983 in _find_and_load
  File "C:\Users\xmk\Desktop\taichi\python\taichi\lang\util.py", line 7 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 728 in exec_module
  File "<frozen importlib._bootstrap>", line 677 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 967 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 983 in _find_and_load
  File "C:\Users\xmk\Desktop\taichi\python\taichi\lang\expr.py", line 2 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 728 in exec_module
  File "<frozen importlib._bootstrap>", line 677 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 967 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 983 in _find_and_load
  File "C:\Users\xmk\Desktop\taichi\python\taichi\lang\impl.py", line 3 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 728 in exec_module
  File "<frozen importlib._bootstrap>", line 677 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 967 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 983 in _find_and_load
  File "C:\Users\xmk\Desktop\taichi\python\taichi\lang\__init__.py", line 1 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 728 in exec_module
  File "<frozen importlib._bootstrap>", line 677 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 967 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 983 in _find_and_load
  File "C:\Users\xmk\Desktop\taichi\python\taichi\__init__.py", line 17 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 728 in exec_module
  File "<frozen importlib._bootstrap>", line 677 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 967 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 983 in _find_and_load
  File "C:\Users\xmk\Desktop\taichi\tests\python\test_tensor_dimensionality.py", line 1 in <module>
  File "D:\Anaconda3\lib\site-packages\_pytest\assertion\rewrite.py", line 149 in exec_module
  File "<frozen importlib._bootstrap>", line 677 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 967 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 983 in _find_and_load
  File "D:\Anaconda3\lib\site-packages\py\_path\local.py", line 701 in pyimport
  File "D:\Anaconda3\lib\site-packages\_pytest\python.py", line 498 in _importtestmodule
  File "D:\Anaconda3\lib\site-packages\_pytest\python.py", line 431 in _getobj
  File "D:\Anaconda3\lib\site-packages\_pytest\python.py", line 250 in obj
  File "D:\Anaconda3\lib\site-packages\_pytest\python.py", line 446 in _inject_setup_module_fixture
  File "D:\Anaconda3\lib\site-packages\_pytest\python.py", line 434 in collect
  File "D:\Anaconda3\lib\site-packages\_pytest\runner.py", line 247 in <lambda>
  File "D:\Anaconda3\lib\site-packages\_pytest\runner.py", line 220 in from_call
  File "D:\Anaconda3\lib\site-packages\_pytest\runner.py", line 247 in pytest_make_collect_report
  File "D:\Anaconda3\lib\site-packages\pluggy\callers.py", line 187 in _multicall
  File "D:\Anaconda3\lib\site-packages\pluggy\manager.py", line 81 in <lambda>
  File "D:\Anaconda3\lib\site-packages\pluggy\manager.py", line 87 in _hookexec
  File "D:\Anaconda3\lib\site-packages\pluggy\hooks.py", line 289 in __call__
  File "D:\Anaconda3\lib\site-packages\_pytest\runner.py", line 368 in collect_one_node
  File "D:\Anaconda3\lib\site-packages\_pytest\main.py", line 717 in genitems
  File "D:\Anaconda3\lib\site-packages\_pytest\main.py", line 498 in _perform_collect
  File "D:\Anaconda3\lib\site-packages\_pytest\main.py", line 459 in perform_collect
  File "D:\Anaconda3\lib\site-packages\_pytest\main.py", line 266 in pytest_collection
  File "D:\Anaconda3\lib\site-packages\pluggy\callers.py", line 187 in _multicall
  File "D:\Anaconda3\lib\site-packages\pluggy\manager.py", line 81 in <lambda>
  File "D:\Anaconda3\lib\site-packages\pluggy\manager.py", line 87 in _hookexec
  File "D:\Anaconda3\lib\site-packages\pluggy\hooks.py", line 289 in __call__
  File "D:\Anaconda3\lib\site-packages\_pytest\main.py", line 256 in _main
  File "D:\Anaconda3\lib\site-packages\_pytest\main.py", line 213 in wrap_session
  File "D:\Anaconda3\lib\site-packages\_pytest\main.py", line 250 in pytest_cmdline_main
  File "D:\Anaconda3\lib\site-packages\pluggy\callers.py", line 187 in _multicall
  File "D:\Anaconda3\lib\site-packages\pluggy\manager.py", line 81 in <lambda>
  File "D:\Anaconda3\lib\site-packages\pluggy\manager.py", line 87 in _hookexec
  File "D:\Anaconda3\lib\site-packages\pluggy\hooks.py", line 289 in __call__
  File "D:\Anaconda3\lib\site-packages\_pytest\config\__init__.py", line 74 in main
  File "D:\Anaconda3\Scripts\pytest-script.py", line 10 in <module>
collected 1 item

test_tensor_dimensionality.py Running test on arch=Arch.x64
[T 04/22/20 19:03:48.499] [program.cpp:taichi::lang::Program::Program@47] Program initializing...
[T 04/22/20 19:03:48.500] [memory_pool.cpp:taichi::lang::MemoryPool::MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 19:03:48.501] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 19:03:48.502] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::get_this_thread_data@615] Creating thread local data for thread 10136
[T 04/22/20 19:03:48.503] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 19:03:48.503] [program.cpp:taichi::lang::Program::Program@133] Program (0x2185fb6ad10) arch=x64 initialized.
[T 04/22/20 19:03:48.511] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 19:03:49.220] [llvm_context.cpp:taichi::lang::compile_runtime_bitcode@137] Compiling runtime module bitcode...
[T 04/22/20 19:03:49.617] [llvm_context.cpp:taichi::lang::compile_runtime_bitcode@152] runtime module bitcode compiled.
[T 04/22/20 19:03:49.892] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 19:03:49.893] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@62] Memory allocated. Allocation time = 0.000742 s
[T 04/22/20 19:03:49.894] [program.cpp:taichi::lang::Program::initialize_runtime_system@199] Allocating data structure of size 32 B
[T 04/22/20 19:03:49.916] [program.cpp:taichi::lang::Program::initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 19:03:49.916] [program.cpp:taichi::lang::Program::initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 19:03:49.947] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 0
[D 04/22/20 19:03:49.948] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:49.951] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003a9000
[D 04/22/20 19:03:49.963] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 1
[D 04/22/20 19:03:49.964] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:49.965] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003ac000
[D 04/22/20 19:03:49.978] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 2
[D 04/22/20 19:03:49.980] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:49.981] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003af000
[D 04/22/20 19:03:49.987] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 3
[D 04/22/20 19:03:49.987] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:49.988] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003b2000
[D 04/22/20 19:03:49.990] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 4
[D 04/22/20 19:03:49.990] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 3145728 B (alignment 4096B)
[D 04/22/20 19:03:49.991] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003b5000
[T 04/22/20 19:03:50.003] [program.cpp:taichi::lang::Program::materialize_layout@273] materialize_layout called
[T 04/22/20 19:03:50.034] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\kernel.py:__call__@414] Compiling kernel fill_c4_0...
Running test on arch=Arch.x64
[T 04/22/20 19:03:50.118] [program.cpp:taichi::lang::Program::finalize@481] Program finalizing...
current_test: test_tensor_dimensionality.py::test_dimensionality (call)
codegen_offloaded_tasks: 3.00
codegen_statements  : 58.00
[T 04/22/20 19:03:50.127] [program.cpp:taichi::lang::Program::finalize@514] Program (0x2185fb6ad10) finalized.
[T 04/22/20 19:03:50.132] [program.cpp:taichi::lang::Program::Program@47] Program initializing...
[T 04/22/20 19:03:50.133] [memory_pool.cpp:taichi::lang::MemoryPool::MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 19:03:50.134] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 19:03:50.134] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::get_this_thread_data@615] Creating thread local data for thread 10136
[T 04/22/20 19:03:50.135] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 19:03:50.135] [program.cpp:taichi::lang::Program::Program@133] Program (0x2185fb66010) arch=x64 initialized.
[T 04/22/20 19:03:50.141] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 19:03:50.397] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 19:03:50.398] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@62] Memory allocated. Allocation time = 0.000711 s
[T 04/22/20 19:03:50.398] [program.cpp:taichi::lang::Program::initialize_runtime_system@199] Allocating data structure of size 64 B
[T 04/22/20 19:03:50.423] [program.cpp:taichi::lang::Program::initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 19:03:50.423] [program.cpp:taichi::lang::Program::initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 19:03:50.443] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 0
[D 04/22/20 19:03:50.444] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:50.444] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003a9000
[D 04/22/20 19:03:50.451] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 1
[D 04/22/20 19:03:50.451] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:50.454] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003ac000
[D 04/22/20 19:03:50.456] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 2
[D 04/22/20 19:03:50.458] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:50.459] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003af000
[D 04/22/20 19:03:50.461] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 3
[D 04/22/20 19:03:50.461] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:50.462] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003b2000
[D 04/22/20 19:03:50.466] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 4
[D 04/22/20 19:03:50.466] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 3145728 B (alignment 4096B)
[D 04/22/20 19:03:50.469] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003b5000
[T 04/22/20 19:03:50.483] [program.cpp:taichi::lang::Program::materialize_layout@273] materialize_layout called
[T 04/22/20 19:03:50.525] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\kernel.py:__call__@414] Compiling kernel fill_c6_0...
Running test on arch=Arch.x64
[T 04/22/20 19:03:50.610] [program.cpp:taichi::lang::Program::finalize@481] Program finalizing...
current_test: test_tensor_dimensionality.py::test_dimensionality (call)
codegen_offloaded_tasks: 3.00
codegen_statements  : 74.00
[T 04/22/20 19:03:50.617] [program.cpp:taichi::lang::Program::finalize@514] Program (0x2185fb66010) finalized.
[T 04/22/20 19:03:50.622] [program.cpp:taichi::lang::Program::Program@47] Program initializing...
[T 04/22/20 19:03:50.622] [memory_pool.cpp:taichi::lang::MemoryPool::MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 19:03:50.623] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 19:03:50.627] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::get_this_thread_data@615] Creating thread local data for thread 10136
[T 04/22/20 19:03:50.628] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 19:03:50.628] [program.cpp:taichi::lang::Program::Program@133] Program (0x2185fb68b60) arch=x64 initialized.
[T 04/22/20 19:03:50.635] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 19:03:50.880] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 19:03:50.881] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@62] Memory allocated. Allocation time = 0.000716 s
[T 04/22/20 19:03:50.883] [program.cpp:taichi::lang::Program::initialize_runtime_system@199] Allocating data structure of size 128 B
[T 04/22/20 19:03:50.905] [program.cpp:taichi::lang::Program::initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 19:03:50.905] [program.cpp:taichi::lang::Program::initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 19:03:50.926] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 0
[D 04/22/20 19:03:50.927] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:50.928] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003a9000
[D 04/22/20 19:03:50.934] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 1
[D 04/22/20 19:03:50.935] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:50.937] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003ac000
[D 04/22/20 19:03:50.938] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 2
[D 04/22/20 19:03:50.942] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:50.943] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003af000
[D 04/22/20 19:03:50.944] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 3
[D 04/22/20 19:03:50.945] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:50.947] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003b2000
[D 04/22/20 19:03:50.949] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 4
[D 04/22/20 19:03:50.952] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 3145728 B (alignment 4096B)
[D 04/22/20 19:03:50.953] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003b5000
[T 04/22/20 19:03:50.963] [program.cpp:taichi::lang::Program::materialize_layout@273] materialize_layout called
[T 04/22/20 19:03:50.995] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\kernel.py:__call__@414] Compiling kernel fill_c8_0...
Running test on arch=Arch.x64
[T 04/22/20 19:03:51.075] [program.cpp:taichi::lang::Program::finalize@481] Program finalizing...
current_test: test_tensor_dimensionality.py::test_dimensionality (call)
codegen_offloaded_tasks: 3.00
codegen_statements  : 90.00
[T 04/22/20 19:03:51.083] [program.cpp:taichi::lang::Program::finalize@514] Program (0x2185fb68b60) finalized.
[T 04/22/20 19:03:51.089] [program.cpp:taichi::lang::Program::Program@47] Program initializing...
[T 04/22/20 19:03:51.089] [memory_pool.cpp:taichi::lang::MemoryPool::MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 19:03:51.090] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 19:03:51.096] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::get_this_thread_data@615] Creating thread local data for thread 10136
[T 04/22/20 19:03:51.097] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 19:03:51.101] [program.cpp:taichi::lang::Program::Program@133] Program (0x2185fb6b1e0) arch=x64 initialized.
[T 04/22/20 19:03:51.106] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 19:03:51.341] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 19:03:51.341] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@62] Memory allocated. Allocation time = 0.000685 s
[T 04/22/20 19:03:51.342] [program.cpp:taichi::lang::Program::initialize_runtime_system@199] Allocating data structure of size 256 B
[T 04/22/20 19:03:51.360] [program.cpp:taichi::lang::Program::initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 19:03:51.361] [program.cpp:taichi::lang::Program::initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 19:03:51.387] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 0
[D 04/22/20 19:03:51.388] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:51.389] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003a9000
[D 04/22/20 19:03:51.391] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 1
[D 04/22/20 19:03:51.392] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:51.392] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003ac000
[D 04/22/20 19:03:51.394] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 2
[D 04/22/20 19:03:51.395] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:51.396] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003af000
[D 04/22/20 19:03:51.397] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 3
[D 04/22/20 19:03:51.398] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:51.399] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003b2000
[D 04/22/20 19:03:51.404] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 4
[D 04/22/20 19:03:51.405] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 3145728 B (alignment 4096B)
[D 04/22/20 19:03:51.405] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003b5000
[T 04/22/20 19:03:51.416] [program.cpp:taichi::lang::Program::materialize_layout@273] materialize_layout called
[T 04/22/20 19:03:51.448] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\kernel.py:__call__@414] Compiling kernel fill_c10_0...
Running test on arch=Arch.x64
[T 04/22/20 19:03:51.541] [program.cpp:taichi::lang::Program::finalize@481] Program finalizing...
current_test: test_tensor_dimensionality.py::test_dimensionality (call)
codegen_offloaded_tasks: 3.00
codegen_statements  : 106.00
[T 04/22/20 19:03:51.550] [program.cpp:taichi::lang::Program::finalize@514] Program (0x2185fb6b1e0) finalized.
[T 04/22/20 19:03:51.556] [program.cpp:taichi::lang::Program::Program@47] Program initializing...
[T 04/22/20 19:03:51.557] [memory_pool.cpp:taichi::lang::MemoryPool::MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 19:03:51.558] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 19:03:51.558] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::get_this_thread_data@615] Creating thread local data for thread 10136
[T 04/22/20 19:03:51.559] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 19:03:51.559] [program.cpp:taichi::lang::Program::Program@133] Program (0x2185fb68b60) arch=x64 initialized.
[T 04/22/20 19:03:51.565] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 19:03:51.808] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 19:03:51.808] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@62] Memory allocated. Allocation time = 0.000695 s
[T 04/22/20 19:03:51.809] [program.cpp:taichi::lang::Program::initialize_runtime_system@199] Allocating data structure of size 512 B
[T 04/22/20 19:03:51.825] [program.cpp:taichi::lang::Program::initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 19:03:51.826] [program.cpp:taichi::lang::Program::initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 19:03:51.843] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 0
[D 04/22/20 19:03:51.844] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:51.849] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003a9000
[D 04/22/20 19:03:51.861] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 1
[D 04/22/20 19:03:51.863] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:51.864] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003ac000
[D 04/22/20 19:03:51.877] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 2
[D 04/22/20 19:03:51.879] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:51.880] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003af000
[D 04/22/20 19:03:51.885] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 3
[D 04/22/20 19:03:51.885] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:51.886] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003b2000
[D 04/22/20 19:03:51.888] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 4
[D 04/22/20 19:03:51.889] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 3145728 B (alignment 4096B)
[D 04/22/20 19:03:51.890] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003b5000
[T 04/22/20 19:03:51.907] [program.cpp:taichi::lang::Program::materialize_layout@273] materialize_layout called
[T 04/22/20 19:03:51.944] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\kernel.py:__call__@414] Compiling kernel fill_c12_0...
Running test on arch=Arch.x64
[T 04/22/20 19:03:52.027] [program.cpp:taichi::lang::Program::finalize@481] Program finalizing...
current_test: test_tensor_dimensionality.py::test_dimensionality (call)
codegen_offloaded_tasks: 3.00
codegen_statements  : 122.00
[T 04/22/20 19:03:52.035] [program.cpp:taichi::lang::Program::finalize@514] Program (0x2185fb68b60) finalized.
[T 04/22/20 19:03:52.040] [program.cpp:taichi::lang::Program::Program@47] Program initializing...
[T 04/22/20 19:03:52.041] [memory_pool.cpp:taichi::lang::MemoryPool::MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 19:03:52.042] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 19:03:52.048] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::get_this_thread_data@615] Creating thread local data for thread 10136
[T 04/22/20 19:03:52.049] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 19:03:52.053] [program.cpp:taichi::lang::Program::Program@133] Program (0x2185fb6a370) arch=x64 initialized.
[T 04/22/20 19:03:52.059] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 19:03:52.300] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 19:03:52.301] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@62] Memory allocated. Allocation time = 0.000793 s
[T 04/22/20 19:03:52.301] [program.cpp:taichi::lang::Program::initialize_runtime_system@199] Allocating data structure of size 1024 B
[T 04/22/20 19:03:52.319] [program.cpp:taichi::lang::Program::initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 19:03:52.320] [program.cpp:taichi::lang::Program::initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 19:03:52.338] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 0
[D 04/22/20 19:03:52.339] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:52.340] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003a9000
[D 04/22/20 19:03:52.346] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 1
[D 04/22/20 19:03:52.347] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:52.349] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003ac000
[D 04/22/20 19:03:52.351] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 2
[D 04/22/20 19:03:52.355] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:52.356] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003af000
[D 04/22/20 19:03:52.358] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 3
[D 04/22/20 19:03:52.359] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:52.360] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003b2000
[D 04/22/20 19:03:52.366] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 4
[D 04/22/20 19:03:52.367] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 3145728 B (alignment 4096B)
[D 04/22/20 19:03:52.367] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003b5000
[T 04/22/20 19:03:52.381] [program.cpp:taichi::lang::Program::materialize_layout@273] materialize_layout called
[T 04/22/20 19:03:52.421] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\kernel.py:__call__@414] Compiling kernel fill_c14_0...
Running test on arch=Arch.x64
[T 04/22/20 19:03:52.508] [program.cpp:taichi::lang::Program::finalize@481] Program finalizing...
current_test: test_tensor_dimensionality.py::test_dimensionality (call)
codegen_offloaded_tasks: 3.00
codegen_statements  : 138.00
[T 04/22/20 19:03:52.516] [program.cpp:taichi::lang::Program::finalize@514] Program (0x2185fb6a370) finalized.
[T 04/22/20 19:03:52.521] [program.cpp:taichi::lang::Program::Program@47] Program initializing...
[T 04/22/20 19:03:52.521] [memory_pool.cpp:taichi::lang::MemoryPool::MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 19:03:52.523] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 19:03:52.527] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::get_this_thread_data@615] Creating thread local data for thread 10136
[T 04/22/20 19:03:52.530] [llvm_context.cpp:taichi::lang::TaichiLLVMContext::TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 19:03:52.530] [program.cpp:taichi::lang::Program::Program@133] Program (0x2185fb68b60) arch=x64 initialized.
[T 04/22/20 19:03:52.540] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 19:03:52.779] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 19:03:52.780] [unified_allocator.cpp:taichi::lang::UnifiedAllocator::UnifiedAllocator@62] Memory allocated. Allocation time = 0.000759 s
[T 04/22/20 19:03:52.781] [program.cpp:taichi::lang::Program::initialize_runtime_system@199] Allocating data structure of size 2048 B
[T 04/22/20 19:03:52.797] [program.cpp:taichi::lang::Program::initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 19:03:52.797] [program.cpp:taichi::lang::Program::initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 19:03:52.818] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 0
[D 04/22/20 19:03:52.820] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:52.820] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003a9000
[D 04/22/20 19:03:52.834] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 1
[D 04/22/20 19:03:52.835] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:52.836] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003ac000
[D 04/22/20 19:03:52.849] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 2
[D 04/22/20 19:03:52.851] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:52.852] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003af000
[D 04/22/20 19:03:52.865] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 3
[D 04/22/20 19:03:52.866] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 8232 B (alignment 4096B)
[D 04/22/20 19:03:52.868] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003b2000
[D 04/22/20 19:03:52.874] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@104] Processing memory alloc request 4
[D 04/22/20 19:03:52.875] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@111]   Allocating memory 3145728 B (alignment 4096B)
[D 04/22/20 19:03:52.876] [memory_pool.cpp:taichi::lang::MemoryPool::daemon@113]   Allocated. Ptr = 0x218003b5000
[T 04/22/20 19:03:52.891] [program.cpp:taichi::lang::Program::materialize_layout@273] materialize_layout called
[T 04/22/20 19:03:52.934] [C:\Users\xmk\Desktop\taichi\python\taichi\lang\kernel.py:__call__@414] Compiling kernel fill_c16_0...
.

============================================================================ 1 passed in 5.47 seconds ============================================================================

The final Compiling kernel fill_c16_0's statistics is mistakenly written to test_tensor_reflection__test_POT when running ti test.

xumingkuan commented 4 years ago

@yuanming-hu do you have any ideas to fix this? Having tests and statistics mismatched looks bad.

yuanming-hu commented 4 years ago

We compile some kernels after Program finalizes, so we've got some dislocations on statistics and test names.

Interesting - I think the real issue here is kernels should always be compiled when a program exists. I'm looking into test_tensor_dimensionality - do you have an idea of how a kernel is compiled without a program?

xumingkuan commented 4 years ago

Oh, the kernel is compiled with a program, but it seems that the last Program's destructor function is never called.

yuanming-hu commented 4 years ago

Maybe that is because Windows fatal exception: code 0xc0000138 happens before the program's dtor? Note that the outputs are not guaranteed to be sorted in wall-clock time when you have multiple output streams...

xumingkuan commented 4 years ago

If I use ti test -v rather than pytest -s, Windows fatal exception doesn't happen. Would you please print something in Program's constructors and destructors to double-check if the last Program is not destructed?

yuanming-hu commented 4 years ago

Setting the environment variable TI_LOG_LEVEL=trace will do the job.

xumingkuan commented 4 years ago

Looks like each test ends with Compiling kernel ... rather than Program finalizing with TI_LOG_LEVEL=trace...

xumingkuan commented 4 years ago

BTW, the number of Program's constructions and destructions can be greater than the number of tests in files like test_tensor_dimensionality.py. However, we only record one of the Program's statistics.

yuanming-hu commented 4 years ago

On Linux it looks fine to me:

pytest -s test_tensor_dimensionality.py
=================================================== test session starts ====================================================
platform linux -- Python 3.6.9, pytest-5.4.1, py-1.8.0, pluggy-0.13.0
rootdir: /home/yuanming/repos/taichi/tests/python
plugins: arraydiff-0.3, forked-1.1.3, remotedata-0.3.2, xdist-1.31.0, doctestplus-0.4.0, openfiles-0.4.0
collecting ... [Taichi] mode=development
[Taichi] preparing sandbox at /tmp/taichi-le6az2sv
[Taichi] sandbox prepared
[T 04/22/20 20:17:34.188] [cuda_driver.cpp:CUDADriver@44] CUDA driver API (v10.2) loaded.
[Taichi] <dev mode>, supported archs: [cpu, cuda, opengl], commit 01391957, python 3.6.9
collected 1 item                                                                                                           

test_tensor_dimensionality.py Running test on arch=Arch.x64
[T 04/22/20 20:17:34.399] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:34.399] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:34.399] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:34.399] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:34.400] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:34.400] [program.cpp:Program@133] Program (0x56273a7f02d0) arch=x64 initialized.
[T 04/22/20 20:17:34.401] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:34.406] [llvm_context.cpp:compile_runtime_bitcode@137] Compiling runtime module bitcode...
[T 04/22/20 20:17:34.622] [llvm_context.cpp:compile_runtime_bitcode@152] runtime module bitcode compiled.
[T 04/22/20 20:17:34.731] [unified_allocator.cpp:UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 20:17:34.731] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 3.29e-05 s
[T 04/22/20 20:17:34.731] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 32 B
[T 04/22/20 20:17:34.739] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:34.739] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:34.749] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:34.749] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:34.749] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f4170974000
[D 04/22/20 20:17:34.750] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:34.750] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:34.750] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f4170977000
[D 04/22/20 20:17:34.751] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:34.751] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:34.751] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f417097a000
[D 04/22/20 20:17:34.752] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:34.752] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:34.752] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f417097d000
[D 04/22/20 20:17:34.753] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:34.753] [memory_pool.cpp:daemon@111]   Allocating memory 3145728 B (alignment 4096B) 
[D 04/22/20 20:17:34.753] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f4170980000
[T 04/22/20 20:17:34.758] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:34.772] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c4_0...
Running test on arch=Arch.cuda
[T 04/22/20 20:17:34.814] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:34.815] [program.cpp:finalize@512] Program (0x56273a7f02d0) finalized.
[T 04/22/20 20:17:34.817] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:34.829] [cuda_context.cpp:CUDAContext@25] Using CUDA device [id=0]: GeForce GTX 1080 Ti
[T 04/22/20 20:17:34.829] [cuda_context.cpp:CUDAContext@33] CUDA Device Compute Capability: 6.1
[T 04/22/20 20:17:34.875] [cuda_context.cpp:CUDAContext@38] Total memory 10.91 GB; free memory 9.74 GB
[T 04/22/20 20:17:34.875] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:34.876] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:34.876] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:34.876] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:34.876] [program.cpp:Program@133] Program (0x56273abbbac0) arch=cuda initialized.
[T 04/22/20 20:17:34.877] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:34.986] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:34.986] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: cuda
[T 04/22/20 20:17:34.986] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:34.986] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:34.995] [llvm_context.cpp:compile_runtime_bitcode@137] Compiling runtime module bitcode...
[T 04/22/20 20:17:35.218] [llvm_context.cpp:compile_runtime_bitcode@152] runtime module bitcode compiled.
[T 04/22/20 20:17:35.414] [jit_cuda.cpp:add_module@90] PTX size: 82.11KB
[T 04/22/20 20:17:35.414] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:35.419] [jit_cuda.cpp:add_module@97] CUDA module load time : 4.472970962524414ms
[T 04/22/20 20:17:35.419] [unified_allocator.cpp:UnifiedAllocator@26] Allocating unified (CPU+GPU) address space of size 1024 MB
[T 04/22/20 20:17:35.419] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 6.01e-05 s
[T 04/22/20 20:17:35.419] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 32 B
[T 04/22/20 20:17:35.428] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:35.428] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:35.437] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:35.437] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:35.437] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143a9000
[D 04/22/20 20:17:35.439] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:35.439] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:35.439] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143ac000
[D 04/22/20 20:17:35.440] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:35.440] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:35.440] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143af000
[D 04/22/20 20:17:35.441] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:35.441] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:35.441] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143b2000
[D 04/22/20 20:17:35.442] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:35.442] [memory_pool.cpp:daemon@111]   Allocating memory 3145728 B (alignment 4096B) 
[D 04/22/20 20:17:35.442] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143b5000
[T 04/22/20 20:17:35.460] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c6_0...
[T 04/22/20 20:17:35.646] [jit_cuda.cpp:add_module@90] PTX size: 83.62KB
[T 04/22/20 20:17:35.646] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:35.647] [jit_cuda.cpp:add_module@97] CUDA module load time : 0.4448890686035156ms
[D 04/22/20 20:17:35.647] [codegen_cuda.cpp:operator()@84] Launching kernel fill_c6_0_kernel_4_range_for<<<896, 64>>>
[T 04/22/20 20:17:35.647] [jit_cuda.cpp:lookup_function@47] Kernel fill_c6_0_kernel_4_range_for compilation time: 0.0011920928955078125ms
Running test on arch=Arch.opengl
[T 04/22/20 20:17:35.661] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:35.661] [program.cpp:finalize@512] Program (0x56273abbbac0) finalized.
[T 04/22/20 20:17:35.670] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:35.670] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:35.670] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:35.670] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:35.670] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:35.670] [program.cpp:Program@133] Program (0x56273abbbac0) arch=opengl initialized.
[T 04/22/20 20:17:35.671] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:35.780] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:35.780] [program.cpp:materialize_layout@298] OpenGL root buffer size: 32 B
[T 04/22/20 20:17:35.927] [opengl_extension.inc.h:initialize_opengl@3] [glsl] Found GL_ARB_compute_shader
[T 04/22/20 20:17:35.927] [opengl_extension.inc.h:initialize_opengl@4] [glsl] Found GL_ARB_gpu_shader_int64
[T 04/22/20 20:17:35.927] [opengl_extension.inc.h:initialize_opengl@5] [glsl] Found GL_NV_shader_atomic_float
[T 04/22/20 20:17:35.927] [opengl_extension.inc.h:initialize_opengl@6] [glsl] Found GL_NV_shader_atomic_float64
[T 04/22/20 20:17:35.927] [opengl_extension.inc.h:initialize_opengl@7] [glsl] Found GL_NV_shader_atomic_int64
[T 04/22/20 20:17:35.928] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:35.928] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:35.928] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:35.928] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:35.929] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c8_0...
[D 04/22/20 20:17:35.931] [opengl_api.cpp:display_kernel_info@290] source of kernel [fill_c8_00] * 1:
#version 430 core
#extension GL_ARB_compute_shader: enable
#extension GL_ARB_gpu_shader_int64: enable
#extension GL_NV_shader_atomic_float: enable
#extension GL_NV_shader_atomic_float64: enable
#extension GL_NV_shader_atomic_int64: enable
precision highp float;
layout(packed, binding = 0) buffer data_i32 { int _states_[2]; int _data_i32_[]; };
layout(packed, binding = 0) buffer data_f32 { int _unused1_[2]; float _data_f32_[]; };
layout(packed, binding = 0) buffer data_f64 { int _unused2_[2]; double _data_f64_[]; };

void fill_c8_00()
{ // range for
  // range known at compile time
  int _tid = int(gl_GlobalInvocationID.x);
  if (_tid >= 4) return;
  int _itv = 0 + _tid * 1;
    int G = _itv;
    int H = (((0 + G) >> 1) & ((1 << 1) - 1));
    int L = (((0 + G) >> 0) & ((1 << 1) - 1));
    int O = H + L;
    int S = 0;
    int Gm = 0;
    int U = S + 32 * Gm; // S0
    int V = U + 0; // S1
    int W = (((0 + H) >> 0) & ((1 << 1) - 1));
    int X = (((0 + L) >> 0) & ((1 << 1) - 1));
    int Go = 1;
    int Gp = X * Go;
    int Gq = Gm + Gp;
    int Gr = 2;
    int Gs = W * Gr;
    int Gt = Gq + Gs;
    int Z = V + 8 * Gt; // S1
    int Aq = Z + 0; // S2
    int As = atomicAdd(_data_i32_[Aq >> 2], O);
    int AE = Z + 4; // S3
    int AG = atomicAdd(_data_i32_[AE >> 2], H);
}

void main()
{
  fill_c8_00();
}
layout(local_size_x = 4 /* 1, 4 */, local_size_y = 1, local_size_z = 1) in;

[T 04/22/20 20:17:35.931] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:35.931] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:35.931] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:35.931] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:35.932] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:35.932] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:35.932] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:35.932] [opengl_api.cpp:link@129] glLinkProgram OUT
Running test on arch=Arch.x64
[T 04/22/20 20:17:35.932] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:35.933] [program.cpp:finalize@512] Program (0x56273abbbac0) finalized.
[T 04/22/20 20:17:35.936] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:35.936] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:35.936] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:35.936] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:35.936] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:35.936] [program.cpp:Program@133] Program (0x56273b70b7e0) arch=x64 initialized.
[T 04/22/20 20:17:35.937] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:36.047] [unified_allocator.cpp:UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 20:17:36.047] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 2.22e-05 s
[T 04/22/20 20:17:36.047] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 64 B
[T 04/22/20 20:17:36.055] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:36.055] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:36.064] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:36.064] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:36.064] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143a9000
[D 04/22/20 20:17:36.065] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:36.065] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:36.065] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143ac000
[D 04/22/20 20:17:36.067] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:36.067] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:36.067] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143af000
[D 04/22/20 20:17:36.068] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:36.068] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:36.068] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143b2000
[D 04/22/20 20:17:36.069] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:36.069] [memory_pool.cpp:daemon@111]   Allocating memory 3145728 B (alignment 4096B) 
[D 04/22/20 20:17:36.069] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143b5000
[T 04/22/20 20:17:36.074] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:36.088] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c10_0...
Running test on arch=Arch.cuda
[T 04/22/20 20:17:36.126] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:36.127] [program.cpp:finalize@512] Program (0x56273b70b7e0) finalized.
[T 04/22/20 20:17:36.129] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:36.129] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:36.129] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:36.129] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:36.129] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:36.129] [program.cpp:Program@133] Program (0x56273abbbac0) arch=cuda initialized.
[T 04/22/20 20:17:36.130] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:36.241] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:36.241] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: cuda
[T 04/22/20 20:17:36.241] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:36.241] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:36.438] [jit_cuda.cpp:add_module@90] PTX size: 82.16KB
[T 04/22/20 20:17:36.438] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:36.438] [jit_cuda.cpp:add_module@97] CUDA module load time : 0.4589557647705078ms
[T 04/22/20 20:17:36.439] [unified_allocator.cpp:UnifiedAllocator@26] Allocating unified (CPU+GPU) address space of size 1024 MB
[T 04/22/20 20:17:36.439] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 6.41e-05 s
[T 04/22/20 20:17:36.439] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 64 B
[T 04/22/20 20:17:36.447] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:36.447] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:36.457] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:36.457] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:36.457] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43a9000
[D 04/22/20 20:17:36.458] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:36.458] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:36.458] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43ac000
[D 04/22/20 20:17:36.459] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:36.459] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:36.459] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43af000
[D 04/22/20 20:17:36.460] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:36.460] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:36.460] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43b2000
[D 04/22/20 20:17:36.461] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:36.461] [memory_pool.cpp:daemon@111]   Allocating memory 3145728 B (alignment 4096B) 
[D 04/22/20 20:17:36.461] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43b5000
[T 04/22/20 20:17:36.479] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c12_0...
[T 04/22/20 20:17:36.665] [jit_cuda.cpp:add_module@90] PTX size: 83.76KB
[T 04/22/20 20:17:36.665] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:36.665] [jit_cuda.cpp:add_module@97] CUDA module load time : 0.4951953887939453ms
[D 04/22/20 20:17:36.665] [codegen_cuda.cpp:operator()@84] Launching kernel fill_c12_0_kernel_10_range_for<<<896, 64>>>
[T 04/22/20 20:17:36.665] [jit_cuda.cpp:lookup_function@47] Kernel fill_c12_0_kernel_10_range_for compilation time: 0.00095367431640625ms
Running test on arch=Arch.opengl
[T 04/22/20 20:17:36.680] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:36.680] [program.cpp:finalize@512] Program (0x56273abbbac0) finalized.
[T 04/22/20 20:17:36.689] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:36.689] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:36.689] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:36.689] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:36.689] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:36.689] [program.cpp:Program@133] Program (0x56273b70b7e0) arch=opengl initialized.
[T 04/22/20 20:17:36.690] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:36.802] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:36.802] [program.cpp:materialize_layout@298] OpenGL root buffer size: 64 B
[T 04/22/20 20:17:36.803] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:36.803] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:36.803] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:36.803] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:36.806] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c14_0...
[D 04/22/20 20:17:36.808] [opengl_api.cpp:display_kernel_info@290] source of kernel [fill_c14_00] * 1:
#version 430 core
#extension GL_ARB_compute_shader: enable
#extension GL_ARB_gpu_shader_int64: enable
#extension GL_NV_shader_atomic_float: enable
#extension GL_NV_shader_atomic_float64: enable
#extension GL_NV_shader_atomic_int64: enable
precision highp float;
layout(packed, binding = 0) buffer data_i32 { int _states_[2]; int _data_i32_[]; };
layout(packed, binding = 0) buffer data_f32 { int _unused1_[2]; float _data_f32_[]; };
layout(packed, binding = 0) buffer data_f64 { int _unused2_[2]; double _data_f64_[]; };

void fill_c14_00()
{ // range for
  // range known at compile time
  int _tid = int(gl_GlobalInvocationID.x);
  if (_tid >= 8) return;
  int _itv = 0 + _tid * 1;
    int G = _itv;
    int H = (((0 + G) >> 2) & ((1 << 1) - 1));
    int L = (((0 + G) >> 1) & ((1 << 1) - 1));
    int O = (((0 + G) >> 0) & ((1 << 1) - 1));
    int R = H + L;
    int S = R + O;
    int X = 0;
    int OS = 0;
    int Z = X + 64 * OS; // S0
    int Aq = Z + 0; // S1
    int Ar = (((0 + H) >> 0) & ((1 << 1) - 1));
    int As = (((0 + L) >> 0) & ((1 << 1) - 1));
    int At = (((0 + O) >> 0) & ((1 << 1) - 1));
    int OU = 1;
    int OV = At * OU;
    int OW = OS + OV;
    int OX = 2;
    int OY = As * OX;
    int OZ = OW + OY;
    int P0 = 4;
    int P1 = Ar * P0;
    int P2 = OZ + P1;
    int Av = Aq + 8 * P2; // S1
    int Aw = Av + 0; // S2
    int Ay = atomicAdd(_data_i32_[Aw >> 2], S);
    int AM = Av + 4; // S3
    int AO = atomicAdd(_data_i32_[AM >> 2], H);
}

void main()
{
  fill_c14_00();
}
layout(local_size_x = 8 /* 1, 8 */, local_size_y = 1, local_size_z = 1) in;

[T 04/22/20 20:17:36.808] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:36.808] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:36.808] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:36.808] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:36.808] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:36.808] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:36.808] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:36.808] [opengl_api.cpp:link@129] glLinkProgram OUT
Running test on arch=Arch.x64
[T 04/22/20 20:17:36.809] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:36.809] [program.cpp:finalize@512] Program (0x56273b70b7e0) finalized.
[T 04/22/20 20:17:36.812] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:36.812] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:36.812] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:36.812] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:36.812] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:36.812] [program.cpp:Program@133] Program (0x56273c08bd40) arch=x64 initialized.
[T 04/22/20 20:17:36.813] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:36.925] [unified_allocator.cpp:UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 20:17:36.925] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 1.91e-05 s
[T 04/22/20 20:17:36.925] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 128 B
[T 04/22/20 20:17:36.933] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:36.933] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:36.943] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:36.943] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:36.943] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143a9000
[D 04/22/20 20:17:36.944] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:36.944] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:36.944] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143ac000
[D 04/22/20 20:17:36.945] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:36.945] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:36.945] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143af000
[D 04/22/20 20:17:36.946] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:36.946] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:36.946] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143b2000
[D 04/22/20 20:17:36.947] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:36.947] [memory_pool.cpp:daemon@111]   Allocating memory 3145728 B (alignment 4096B) 
[D 04/22/20 20:17:36.947] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143b5000
[T 04/22/20 20:17:36.952] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:36.967] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c16_0...
Running test on arch=Arch.cuda
[T 04/22/20 20:17:37.007] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:37.008] [program.cpp:finalize@512] Program (0x56273c08bd40) finalized.
[T 04/22/20 20:17:37.010] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:37.010] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:37.010] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:37.010] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:37.010] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:37.010] [program.cpp:Program@133] Program (0x56273c2e8cc0) arch=cuda initialized.
[T 04/22/20 20:17:37.011] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:37.123] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:37.123] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: cuda
[T 04/22/20 20:17:37.123] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:37.123] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:37.316] [jit_cuda.cpp:add_module@90] PTX size: 82.22KB
[T 04/22/20 20:17:37.316] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:37.317] [jit_cuda.cpp:add_module@97] CUDA module load time : 0.4658699035644531ms
[T 04/22/20 20:17:37.317] [unified_allocator.cpp:UnifiedAllocator@26] Allocating unified (CPU+GPU) address space of size 1024 MB
[T 04/22/20 20:17:37.317] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 7.20e-05 s
[T 04/22/20 20:17:37.317] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 128 B
[T 04/22/20 20:17:37.326] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:37.326] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:37.336] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:37.336] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:37.336] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43a9000
[D 04/22/20 20:17:37.337] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:37.337] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:37.337] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43ac000
[D 04/22/20 20:17:37.338] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:37.338] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:37.339] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43af000
[D 04/22/20 20:17:37.340] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:37.340] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:37.340] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43b2000
[D 04/22/20 20:17:37.341] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:37.341] [memory_pool.cpp:daemon@111]   Allocating memory 3145728 B (alignment 4096B) 
[D 04/22/20 20:17:37.341] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43b5000
[T 04/22/20 20:17:37.359] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c18_0...
[T 04/22/20 20:17:37.548] [jit_cuda.cpp:add_module@90] PTX size: 83.82KB
[T 04/22/20 20:17:37.548] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:37.549] [jit_cuda.cpp:add_module@97] CUDA module load time : 0.476837158203125ms
[D 04/22/20 20:17:37.549] [codegen_cuda.cpp:operator()@84] Launching kernel fill_c18_0_kernel_16_range_for<<<896, 64>>>
[T 04/22/20 20:17:37.549] [jit_cuda.cpp:lookup_function@47] Kernel fill_c18_0_kernel_16_range_for compilation time: 0.00095367431640625ms
Running test on arch=Arch.opengl
[T 04/22/20 20:17:37.564] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:37.565] [program.cpp:finalize@512] Program (0x56273c2e8cc0) finalized.
[T 04/22/20 20:17:37.573] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:37.573] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:37.573] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:37.573] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:37.573] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:37.573] [program.cpp:Program@133] Program (0x56273abbbac0) arch=opengl initialized.
[T 04/22/20 20:17:37.574] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:37.687] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:37.687] [program.cpp:materialize_layout@298] OpenGL root buffer size: 128 B
[T 04/22/20 20:17:37.688] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:37.688] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:37.688] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:37.688] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:37.690] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c20_0...
[D 04/22/20 20:17:37.692] [opengl_api.cpp:display_kernel_info@290] source of kernel [fill_c20_00] * 1:
#version 430 core
#extension GL_ARB_compute_shader: enable
#extension GL_ARB_gpu_shader_int64: enable
#extension GL_NV_shader_atomic_float: enable
#extension GL_NV_shader_atomic_float64: enable
#extension GL_NV_shader_atomic_int64: enable
precision highp float;
layout(packed, binding = 0) buffer data_i32 { int _states_[2]; int _data_i32_[]; };
layout(packed, binding = 0) buffer data_f32 { int _unused1_[2]; float _data_f32_[]; };
layout(packed, binding = 0) buffer data_f64 { int _unused2_[2]; double _data_f64_[]; };

void fill_c20_00()
{ // range for
  // range known at compile time
  int _tid = int(gl_GlobalInvocationID.x);
  if (_tid >= 16) return;
  int _itv = 0 + _tid * 1;
    int G = _itv;
    int H = (((0 + G) >> 3) & ((1 << 1) - 1));
    int L = (((0 + G) >> 2) & ((1 << 1) - 1));
    int O = (((0 + G) >> 1) & ((1 << 1) - 1));
    int R = (((0 + G) >> 0) & ((1 << 1) - 1));
    int U = H + L;
    int V = U + O;
    int W = V + R;
    int As = 0;
    int Z2 = 0;
    int Au = As + 128 * Z2; // S0
    int Av = Au + 0; // S1
    int Aw = (((0 + H) >> 0) & ((1 << 1) - 1));
    int Ax = (((0 + L) >> 0) & ((1 << 1) - 1));
    int Ay = (((0 + O) >> 0) & ((1 << 1) - 1));
    int Az = (((0 + R) >> 0) & ((1 << 1) - 1));
    int Z4 = 1;
    int Z5 = Az * Z4;
    int Z6 = Z2 + Z5;
    int Z7 = 2;
    int Z8 = Ay * Z7;
    int Z9 = Z6 + Z8;
    int Za = 4;
    int Zb = Ax * Za;
    int Zc = Z9 + Zb;
    int Zd = 8;
    int Ze = Aw * Zd;
    int Zf = Zc + Ze;
    int AB = Av + 8 * Zf; // S1
    int AC = AB + 0; // S2
    int AE = atomicAdd(_data_i32_[AC >> 2], W);
    int AU = AB + 4; // S3
    int AW = atomicAdd(_data_i32_[AU >> 2], H);
}

void main()
{
  fill_c20_00();
}
layout(local_size_x = 16 /* 1, 16 */, local_size_y = 1, local_size_z = 1) in;

[T 04/22/20 20:17:37.692] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:37.692] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:37.692] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:37.692] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:37.693] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:37.693] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:37.693] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:37.693] [opengl_api.cpp:link@129] glLinkProgram OUT
Running test on arch=Arch.x64
[T 04/22/20 20:17:37.693] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:37.694] [program.cpp:finalize@512] Program (0x56273abbbac0) finalized.
[T 04/22/20 20:17:37.696] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:37.696] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:37.696] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:37.696] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:37.697] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:37.697] [program.cpp:Program@133] Program (0x56273c08bd40) arch=x64 initialized.
[T 04/22/20 20:17:37.698] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:37.810] [unified_allocator.cpp:UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 20:17:37.810] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 3.31e-05 s
[T 04/22/20 20:17:37.810] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 256 B
[T 04/22/20 20:17:37.818] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:37.818] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:37.827] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:37.827] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:37.827] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143a9000
[D 04/22/20 20:17:37.829] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:37.829] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:37.829] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143ac000
[D 04/22/20 20:17:37.830] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:37.830] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:37.830] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143af000
[D 04/22/20 20:17:37.831] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:37.831] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:37.831] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143b2000
[D 04/22/20 20:17:37.832] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:37.832] [memory_pool.cpp:daemon@111]   Allocating memory 3145728 B (alignment 4096B) 
[D 04/22/20 20:17:37.832] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143b5000
[T 04/22/20 20:17:37.837] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:37.852] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c22_0...
Running test on arch=Arch.cuda
[T 04/22/20 20:17:37.894] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:37.894] [program.cpp:finalize@512] Program (0x56273c08bd40) finalized.
[T 04/22/20 20:17:37.896] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:37.896] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:37.896] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:37.897] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:37.897] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:37.897] [program.cpp:Program@133] Program (0x56273b13cd50) arch=cuda initialized.
[T 04/22/20 20:17:37.898] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:38.009] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:38.009] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: cuda
[T 04/22/20 20:17:38.009] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:38.009] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:38.206] [jit_cuda.cpp:add_module@90] PTX size: 82.22KB
[T 04/22/20 20:17:38.206] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:38.207] [jit_cuda.cpp:add_module@97] CUDA module load time : 0.43392181396484375ms
[T 04/22/20 20:17:38.207] [unified_allocator.cpp:UnifiedAllocator@26] Allocating unified (CPU+GPU) address space of size 1024 MB
[T 04/22/20 20:17:38.207] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 6.70e-05 s
[T 04/22/20 20:17:38.207] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 256 B
[T 04/22/20 20:17:38.215] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:38.215] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:38.225] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:38.225] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:38.225] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43a9000
[D 04/22/20 20:17:38.226] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:38.226] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:38.226] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43ac000
[D 04/22/20 20:17:38.227] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:38.227] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:38.227] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43af000
[D 04/22/20 20:17:38.228] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:38.228] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:38.228] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43b2000
[D 04/22/20 20:17:38.229] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:38.229] [memory_pool.cpp:daemon@111]   Allocating memory 3145728 B (alignment 4096B) 
[D 04/22/20 20:17:38.229] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43b5000
[T 04/22/20 20:17:38.247] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c24_0...
[T 04/22/20 20:17:38.432] [jit_cuda.cpp:add_module@90] PTX size: 83.88KB
[T 04/22/20 20:17:38.433] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:38.433] [jit_cuda.cpp:add_module@97] CUDA module load time : 0.4620552062988281ms
[D 04/22/20 20:17:38.433] [codegen_cuda.cpp:operator()@84] Launching kernel fill_c24_0_kernel_22_range_for<<<896, 64>>>
[T 04/22/20 20:17:38.433] [jit_cuda.cpp:lookup_function@47] Kernel fill_c24_0_kernel_22_range_for compilation time: 0.00095367431640625ms
Running test on arch=Arch.opengl
[T 04/22/20 20:17:38.448] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:38.448] [program.cpp:finalize@512] Program (0x56273b13cd50) finalized.
[T 04/22/20 20:17:38.457] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:38.457] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:38.457] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:38.457] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:38.457] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:38.457] [program.cpp:Program@133] Program (0x56273b8e1b00) arch=opengl initialized.
[T 04/22/20 20:17:38.458] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:38.570] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:38.570] [program.cpp:materialize_layout@298] OpenGL root buffer size: 256 B
[T 04/22/20 20:17:38.571] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:38.571] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:38.571] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:38.571] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:38.574] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c26_0...
[D 04/22/20 20:17:38.576] [opengl_api.cpp:display_kernel_info@290] source of kernel [fill_c26_00] * 1:
#version 430 core
#extension GL_ARB_compute_shader: enable
#extension GL_ARB_gpu_shader_int64: enable
#extension GL_NV_shader_atomic_float: enable
#extension GL_NV_shader_atomic_float64: enable
#extension GL_NV_shader_atomic_int64: enable
precision highp float;
layout(packed, binding = 0) buffer data_i32 { int _states_[2]; int _data_i32_[]; };
layout(packed, binding = 0) buffer data_f32 { int _unused1_[2]; float _data_f32_[]; };
layout(packed, binding = 0) buffer data_f64 { int _unused2_[2]; double _data_f64_[]; };

void fill_c26_00()
{ // range for
  // range known at compile time
  int _tid = int(gl_GlobalInvocationID.x);
  if (_tid >= 32) return;
  int _itv = 0 + _tid * 1;
    int G = _itv;
    int H = (((0 + G) >> 4) & ((1 << 1) - 1));
    int L = (((0 + G) >> 3) & ((1 << 1) - 1));
    int O = (((0 + G) >> 2) & ((1 << 1) - 1));
    int R = (((0 + G) >> 1) & ((1 << 1) - 1));
    int U = (((0 + G) >> 0) & ((1 << 1) - 1));
    int X = H + L;
    int Y = X + O;
    int Z = Y + R;
    int Aq = Z + U;
    int Ax = 0;
    int AAQ = 0;
    int Az = Ax + 256 * AAQ; // S0
    int AA = Az + 0; // S1
    int AB = (((0 + H) >> 0) & ((1 << 1) - 1));
    int AC = (((0 + L) >> 0) & ((1 << 1) - 1));
    int AD = (((0 + O) >> 0) & ((1 << 1) - 1));
    int AE = (((0 + R) >> 0) & ((1 << 1) - 1));
    int AF = (((0 + U) >> 0) & ((1 << 1) - 1));
    int AAS = 1;
    int AAT = AF * AAS;
    int AAU = AAQ + AAT;
    int AAV = 2;
    int AAW = AE * AAV;
    int AAX = AAU + AAW;
    int AAY = 4;
    int AAZ = AD * AAY;
    int AB0 = AAX + AAZ;
    int AB1 = 8;
    int AB2 = AC * AB1;
    int AB3 = AB0 + AB2;
    int AB4 = 16;
    int AB5 = AB * AB4;
    int AB6 = AB3 + AB5;
    int AH = AA + 8 * AB6; // S1
    int AI = AH + 0; // S2
    int AK = atomicAdd(_data_i32_[AI >> 2], Aq);
    int B2 = AH + 4; // S3
    int B4 = atomicAdd(_data_i32_[B2 >> 2], H);
}

void main()
{
  fill_c26_00();
}
layout(local_size_x = 32 /* 1, 32 */, local_size_y = 1, local_size_z = 1) in;

[T 04/22/20 20:17:38.576] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:38.576] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:38.576] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:38.576] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:38.577] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:38.577] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:38.577] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:38.577] [opengl_api.cpp:link@129] glLinkProgram OUT
Running test on arch=Arch.x64
[T 04/22/20 20:17:38.578] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:38.579] [program.cpp:finalize@512] Program (0x56273b8e1b00) finalized.
[T 04/22/20 20:17:38.581] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:38.581] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:38.581] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:38.581] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:38.581] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:38.581] [program.cpp:Program@133] Program (0x56273b5dbb30) arch=x64 initialized.
[T 04/22/20 20:17:38.582] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:38.692] [unified_allocator.cpp:UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 20:17:38.692] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 3.29e-05 s
[T 04/22/20 20:17:38.692] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 512 B
[T 04/22/20 20:17:38.701] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:38.701] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:38.710] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:38.710] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:38.710] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143a9000
[D 04/22/20 20:17:38.711] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:38.711] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:38.711] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143ac000
[D 04/22/20 20:17:38.712] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:38.712] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:38.712] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143af000
[D 04/22/20 20:17:38.713] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:38.713] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:38.713] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143b2000
[D 04/22/20 20:17:38.714] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:38.714] [memory_pool.cpp:daemon@111]   Allocating memory 3145728 B (alignment 4096B) 
[D 04/22/20 20:17:38.714] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143b5000
[T 04/22/20 20:17:38.720] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:38.735] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c28_0...
Running test on arch=Arch.cuda
[T 04/22/20 20:17:38.776] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:38.777] [program.cpp:finalize@512] Program (0x56273b5dbb30) finalized.
[T 04/22/20 20:17:38.779] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:38.779] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:38.779] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:38.779] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:38.779] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:38.779] [program.cpp:Program@133] Program (0x56273b2d9800) arch=cuda initialized.
[T 04/22/20 20:17:38.780] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:38.892] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:38.892] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: cuda
[T 04/22/20 20:17:38.892] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:38.892] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:39.085] [jit_cuda.cpp:add_module@90] PTX size: 82.22KB
[T 04/22/20 20:17:39.085] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:39.085] [jit_cuda.cpp:add_module@97] CUDA module load time : 0.46515464782714844ms
[T 04/22/20 20:17:39.086] [unified_allocator.cpp:UnifiedAllocator@26] Allocating unified (CPU+GPU) address space of size 1024 MB
[T 04/22/20 20:17:39.086] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 8.51e-05 s
[T 04/22/20 20:17:39.086] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 512 B
[T 04/22/20 20:17:39.094] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:39.094] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:39.104] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:39.104] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:39.104] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43a9000
[D 04/22/20 20:17:39.105] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:39.105] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:39.105] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43ac000
[D 04/22/20 20:17:39.106] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:39.106] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:39.106] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43af000
[D 04/22/20 20:17:39.107] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:39.107] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:39.107] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43b2000
[D 04/22/20 20:17:39.108] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:39.108] [memory_pool.cpp:daemon@111]   Allocating memory 3145728 B (alignment 4096B) 
[D 04/22/20 20:17:39.108] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43b5000
[T 04/22/20 20:17:39.127] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c30_0...
[T 04/22/20 20:17:39.317] [jit_cuda.cpp:add_module@90] PTX size: 83.93KB
[T 04/22/20 20:17:39.317] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:39.317] [jit_cuda.cpp:add_module@97] CUDA module load time : 0.47397613525390625ms
[D 04/22/20 20:17:39.317] [codegen_cuda.cpp:operator()@84] Launching kernel fill_c30_0_kernel_28_range_for<<<896, 64>>>
[T 04/22/20 20:17:39.317] [jit_cuda.cpp:lookup_function@47] Kernel fill_c30_0_kernel_28_range_for compilation time: 0.00095367431640625ms
Running test on arch=Arch.opengl
[T 04/22/20 20:17:39.334] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:39.335] [program.cpp:finalize@512] Program (0x56273b2d9800) finalized.
[T 04/22/20 20:17:39.355] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:39.355] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:39.355] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:39.355] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:39.355] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:39.355] [program.cpp:Program@133] Program (0x56273b541240) arch=opengl initialized.
[T 04/22/20 20:17:39.357] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:39.473] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:39.473] [program.cpp:materialize_layout@298] OpenGL root buffer size: 512 B
[T 04/22/20 20:17:39.474] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:39.474] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:39.474] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:39.474] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:39.477] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c32_0...
[D 04/22/20 20:17:39.480] [opengl_api.cpp:display_kernel_info@290] source of kernel [fill_c32_00] * 1:
#version 430 core
#extension GL_ARB_compute_shader: enable
#extension GL_ARB_gpu_shader_int64: enable
#extension GL_NV_shader_atomic_float: enable
#extension GL_NV_shader_atomic_float64: enable
#extension GL_NV_shader_atomic_int64: enable
precision highp float;
layout(packed, binding = 0) buffer data_i32 { int _states_[2]; int _data_i32_[]; };
layout(packed, binding = 0) buffer data_f32 { int _unused1_[2]; float _data_f32_[]; };
layout(packed, binding = 0) buffer data_f64 { int _unused2_[2]; double _data_f64_[]; };

void fill_c32_00()
{ // range for
  // range known at compile time
  int _tid = int(gl_GlobalInvocationID.x);
  if (_tid >= 64) return;
  int _itv = 0 + _tid * 1;
    int G = _itv;
    int H = (((0 + G) >> 5) & ((1 << 1) - 1));
    int L = (((0 + G) >> 4) & ((1 << 1) - 1));
    int O = (((0 + G) >> 3) & ((1 << 1) - 1));
    int R = (((0 + G) >> 2) & ((1 << 1) - 1));
    int U = (((0 + G) >> 1) & ((1 << 1) - 1));
    int X = (((0 + G) >> 0) & ((1 << 1) - 1));
    int Aq = H + L;
    int Ar = Aq + O;
    int As = Ar + R;
    int At = As + U;
    int Au = At + X;
    int AC = 0;
    int AOi = 0;
    int AE = AC + 512 * AOi; // S0
    int AF = AE + 0; // S1
    int AG = (((0 + H) >> 0) & ((1 << 1) - 1));
    int AH = (((0 + L) >> 0) & ((1 << 1) - 1));
    int AI = (((0 + O) >> 0) & ((1 << 1) - 1));
    int AJ = (((0 + R) >> 0) & ((1 << 1) - 1));
    int AK = (((0 + U) >> 0) & ((1 << 1) - 1));
    int AL = (((0 + X) >> 0) & ((1 << 1) - 1));
    int AOk = 1;
    int AOl = AL * AOk;
    int AOm = AOi + AOl;
    int AOn = 2;
    int AOo = AK * AOn;
    int AOp = AOm + AOo;
    int AOq = 4;
    int AOr = AJ * AOq;
    int AOs = AOp + AOr;
    int AOt = 8;
    int AOu = AI * AOt;
    int AOv = AOs + AOu;
    int AOw = 16;
    int AOx = AH * AOw;
    int AOy = AOv + AOx;
    int AOz = 32;
    int AOA = AG * AOz;
    int AOB = AOy + AOA;
    int AN = AF + 8 * AOB; // S1
    int AO = AN + 0; // S2
    int AQ = atomicAdd(_data_i32_[AO >> 2], Au);
    int Ba = AN + 4; // S3
    int Bc = atomicAdd(_data_i32_[Ba >> 2], H);
}

void main()
{
  fill_c32_00();
}
layout(local_size_x = 64 /* 1, 64 */, local_size_y = 1, local_size_z = 1) in;

[T 04/22/20 20:17:39.480] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:39.480] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:39.480] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:39.480] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:39.481] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:39.481] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:39.481] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:39.481] [opengl_api.cpp:link@129] glLinkProgram OUT
Running test on arch=Arch.x64
[T 04/22/20 20:17:39.484] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:39.484] [program.cpp:finalize@512] Program (0x56273b541240) finalized.
[T 04/22/20 20:17:39.487] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:39.487] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:39.487] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:39.487] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:39.487] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:39.487] [program.cpp:Program@133] Program (0x56273b007f00) arch=x64 initialized.
[T 04/22/20 20:17:39.488] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:39.601] [unified_allocator.cpp:UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 20:17:39.601] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 3.19e-05 s
[T 04/22/20 20:17:39.601] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 1024 B
[T 04/22/20 20:17:39.609] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:39.609] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:39.619] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:39.619] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:39.619] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143a9000
[D 04/22/20 20:17:39.621] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:39.621] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:39.621] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143ac000
[D 04/22/20 20:17:39.622] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:39.622] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:39.622] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143af000
[D 04/22/20 20:17:39.623] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:39.623] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:39.623] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143b2000
[D 04/22/20 20:17:39.624] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:39.624] [memory_pool.cpp:daemon@111]   Allocating memory 3145728 B (alignment 4096B) 
[D 04/22/20 20:17:39.624] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143b5000
[T 04/22/20 20:17:39.629] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:39.646] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c34_0...
Running test on arch=Arch.cuda
[T 04/22/20 20:17:39.689] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:39.689] [program.cpp:finalize@512] Program (0x56273b007f00) finalized.
[T 04/22/20 20:17:39.692] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:39.692] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:39.692] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:39.692] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:39.693] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:39.693] [program.cpp:Program@133] Program (0x56273c07b960) arch=cuda initialized.
[T 04/22/20 20:17:39.694] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:39.807] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:39.807] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: cuda
[T 04/22/20 20:17:39.807] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:39.807] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:40.002] [jit_cuda.cpp:add_module@90] PTX size: 82.22KB
[T 04/22/20 20:17:40.002] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:40.002] [jit_cuda.cpp:add_module@97] CUDA module load time : 0.5109310150146484ms
[T 04/22/20 20:17:40.003] [unified_allocator.cpp:UnifiedAllocator@26] Allocating unified (CPU+GPU) address space of size 1024 MB
[T 04/22/20 20:17:40.003] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 8.61e-05 s
[T 04/22/20 20:17:40.003] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 1024 B
[T 04/22/20 20:17:40.011] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:40.011] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:40.021] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:40.021] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:40.021] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43a9000
[D 04/22/20 20:17:40.022] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:40.022] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:40.022] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43ac000
[D 04/22/20 20:17:40.023] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:40.023] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:40.023] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43af000
[D 04/22/20 20:17:40.024] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:40.024] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:40.024] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43b2000
[D 04/22/20 20:17:40.025] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:40.025] [memory_pool.cpp:daemon@111]   Allocating memory 3145728 B (alignment 4096B) 
[D 04/22/20 20:17:40.025] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43b5000
[T 04/22/20 20:17:40.046] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c36_0...
[T 04/22/20 20:17:40.233] [jit_cuda.cpp:add_module@90] PTX size: 83.99KB
[T 04/22/20 20:17:40.233] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:40.234] [jit_cuda.cpp:add_module@97] CUDA module load time : 0.5099773406982422ms
[D 04/22/20 20:17:40.234] [codegen_cuda.cpp:operator()@84] Launching kernel fill_c36_0_kernel_34_range_for<<<896, 64>>>
[T 04/22/20 20:17:40.234] [jit_cuda.cpp:lookup_function@47] Kernel fill_c36_0_kernel_34_range_for compilation time: 0.00095367431640625ms
Running test on arch=Arch.opengl
[T 04/22/20 20:17:40.251] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:40.252] [program.cpp:finalize@512] Program (0x56273c07b960) finalized.
[T 04/22/20 20:17:40.261] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:40.261] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:40.262] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:40.262] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:40.262] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:40.262] [program.cpp:Program@133] Program (0x56273abbbac0) arch=opengl initialized.
[T 04/22/20 20:17:40.263] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:40.377] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:40.377] [program.cpp:materialize_layout@298] OpenGL root buffer size: 1024 B
[T 04/22/20 20:17:40.378] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:40.378] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:40.378] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:40.378] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:40.384] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c38_0...
[D 04/22/20 20:17:40.388] [opengl_api.cpp:display_kernel_info@290] source of kernel [fill_c38_00] * 1:
#version 430 core
#extension GL_ARB_compute_shader: enable
#extension GL_ARB_gpu_shader_int64: enable
#extension GL_NV_shader_atomic_float: enable
#extension GL_NV_shader_atomic_float64: enable
#extension GL_NV_shader_atomic_int64: enable
precision highp float;
layout(packed, binding = 0) buffer data_i32 { int _states_[2]; int _data_i32_[]; };
layout(packed, binding = 0) buffer data_f32 { int _unused1_[2]; float _data_f32_[]; };
layout(packed, binding = 0) buffer data_f64 { int _unused2_[2]; double _data_f64_[]; };

void fill_c38_00()
{ // range for
  // range known at compile time
  int _tid = int(gl_GlobalInvocationID.x);
  if (_tid >= 128) return;
  int _itv = 0 + _tid * 1;
    int G = _itv;
    int H = (((0 + G) >> 6) & ((1 << 1) - 1));
    int L = (((0 + G) >> 5) & ((1 << 1) - 1));
    int O = (((0 + G) >> 4) & ((1 << 1) - 1));
    int R = (((0 + G) >> 3) & ((1 << 1) - 1));
    int U = (((0 + G) >> 2) & ((1 << 1) - 1));
    int X = (((0 + G) >> 1) & ((1 << 1) - 1));
    int Aq = (((0 + G) >> 0) & ((1 << 1) - 1));
    int At = H + L;
    int Au = At + O;
    int Av = Au + R;
    int Aw = Av + U;
    int Ax = Aw + X;
    int Ay = Ax + Aq;
    int AH = 0;
    int B3o = 0;
    int AJ = AH + 1024 * B3o; // S0
    int AK = AJ + 0; // S1
    int AL = (((0 + H) >> 0) & ((1 << 1) - 1));
    int AM = (((0 + L) >> 0) & ((1 << 1) - 1));
    int AN = (((0 + O) >> 0) & ((1 << 1) - 1));
    int AO = (((0 + R) >> 0) & ((1 << 1) - 1));
    int AP = (((0 + U) >> 0) & ((1 << 1) - 1));
    int AQ = (((0 + X) >> 0) & ((1 << 1) - 1));
    int AR = (((0 + Aq) >> 0) & ((1 << 1) - 1));
    int B3q = 1;
    int B3r = AR * B3q;
    int B3s = B3o + B3r;
    int B3t = 2;
    int B3u = AQ * B3t;
    int B3v = B3s + B3u;
    int B3w = 4;
    int B3x = AP * B3w;
    int B3y = B3v + B3x;
    int B3z = 8;
    int B3A = AO * B3z;
    int B3B = B3y + B3A;
    int B3C = 16;
    int B3D = AN * B3C;
    int B3E = B3B + B3D;
    int B3F = 32;
    int B3G = AM * B3F;
    int B3H = B3E + B3G;
    int B3I = 64;
    int B3J = AL * B3I;
    int B3K = B3H + B3J;
    int AT = AK + 8 * B3K; // S1
    int AU = AT + 0; // S2
    int AW = atomicAdd(_data_i32_[AU >> 2], Ay);
    int Bi = AT + 4; // S3
    int Bk = atomicAdd(_data_i32_[Bi >> 2], H);
}

void main()
{
  fill_c38_00();
}
layout(local_size_x = 128 /* 1, 128 */, local_size_y = 1, local_size_z = 1) in;

[T 04/22/20 20:17:40.388] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:40.388] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:40.388] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:40.388] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:40.389] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:40.389] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:40.389] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:40.389] [opengl_api.cpp:link@129] glLinkProgram OUT
Running test on arch=Arch.x64
[T 04/22/20 20:17:40.397] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:40.398] [program.cpp:finalize@512] Program (0x56273abbbac0) finalized.
[T 04/22/20 20:17:40.401] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:40.401] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:40.401] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:40.401] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:40.401] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:40.401] [program.cpp:Program@133] Program (0x56273b13f180) arch=x64 initialized.
[T 04/22/20 20:17:40.402] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:40.516] [unified_allocator.cpp:UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 04/22/20 20:17:40.516] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 2.10e-05 s
[T 04/22/20 20:17:40.516] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 2048 B
[T 04/22/20 20:17:40.524] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:40.524] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:40.534] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:40.534] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:40.534] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143a9000
[D 04/22/20 20:17:40.535] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:40.535] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:40.535] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143ac000
[D 04/22/20 20:17:40.536] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:40.536] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:40.536] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143af000
[D 04/22/20 20:17:40.537] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:40.537] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:40.537] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143b2000
[D 04/22/20 20:17:40.538] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:40.538] [memory_pool.cpp:daemon@111]   Allocating memory 3145728 B (alignment 4096B) 
[D 04/22/20 20:17:40.538] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f41143b5000
[T 04/22/20 20:17:40.544] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:40.562] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c40_0...
Running test on arch=Arch.cuda
[T 04/22/20 20:17:40.607] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:40.608] [program.cpp:finalize@512] Program (0x56273b13f180) finalized.
[T 04/22/20 20:17:40.610] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:40.610] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:40.610] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:40.610] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:40.610] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:40.610] [program.cpp:Program@133] Program (0x56273c6e92b0) arch=cuda initialized.
[T 04/22/20 20:17:40.611] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:40.723] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:40.723] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: cuda
[T 04/22/20 20:17:40.723] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:40.723] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:40.922] [jit_cuda.cpp:add_module@90] PTX size: 82.22KB
[T 04/22/20 20:17:40.922] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:40.923] [jit_cuda.cpp:add_module@97] CUDA module load time : 0.4589557647705078ms
[T 04/22/20 20:17:40.923] [unified_allocator.cpp:UnifiedAllocator@26] Allocating unified (CPU+GPU) address space of size 1024 MB
[T 04/22/20 20:17:40.923] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 7.30e-05 s
[T 04/22/20 20:17:40.923] [program.cpp:initialize_runtime_system@199] Allocating data structure of size 2048 B
[T 04/22/20 20:17:40.931] [program.cpp:initialize_runtime_system@208] LLVMRuntime initialized
[T 04/22/20 20:17:40.931] [program.cpp:initialize_runtime_system@210] LLVMRuntime pointer fetched
[D 04/22/20 20:17:40.941] [memory_pool.cpp:daemon@104] Processing memory alloc request 0
[D 04/22/20 20:17:40.941] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:40.941] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43a9000
[D 04/22/20 20:17:40.942] [memory_pool.cpp:daemon@104] Processing memory alloc request 1
[D 04/22/20 20:17:40.942] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:40.942] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43ac000
[D 04/22/20 20:17:40.943] [memory_pool.cpp:daemon@104] Processing memory alloc request 2
[D 04/22/20 20:17:40.943] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:40.943] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43af000
[D 04/22/20 20:17:40.944] [memory_pool.cpp:daemon@104] Processing memory alloc request 3
[D 04/22/20 20:17:40.944] [memory_pool.cpp:daemon@111]   Allocating memory 8232 B (alignment 4096B) 
[D 04/22/20 20:17:40.944] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43b2000
[D 04/22/20 20:17:40.945] [memory_pool.cpp:daemon@104] Processing memory alloc request 4
[D 04/22/20 20:17:40.945] [memory_pool.cpp:daemon@111]   Allocating memory 3145728 B (alignment 4096B) 
[D 04/22/20 20:17:40.945] [memory_pool.cpp:daemon@113]   Allocated. Ptr = 0x7f40c43b5000
[T 04/22/20 20:17:40.966] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c42_0...
[T 04/22/20 20:17:41.153] [jit_cuda.cpp:add_module@90] PTX size: 84.04KB
[T 04/22/20 20:17:41.153] [jit_cuda.cpp:add_module@92] Loading module...
[T 04/22/20 20:17:41.153] [jit_cuda.cpp:add_module@97] CUDA module load time : 0.45800209045410156ms
[D 04/22/20 20:17:41.154] [codegen_cuda.cpp:operator()@84] Launching kernel fill_c42_0_kernel_40_range_for<<<896, 64>>>
[T 04/22/20 20:17:41.154] [jit_cuda.cpp:lookup_function@47] Kernel fill_c42_0_kernel_40_range_for compilation time: 0.014781951904296875ms
Running test on arch=Arch.opengl
[T 04/22/20 20:17:41.171] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:41.172] [program.cpp:finalize@512] Program (0x56273c6e92b0) finalized.
[T 04/22/20 20:17:41.181] [program.cpp:Program@47] Program initializing...
[T 04/22/20 20:17:41.181] [memory_pool.cpp:MemoryPool@9] Memory pool created. Default buffer size per allocator = 1024 MB
[T 04/22/20 20:17:41.181] [llvm_context.cpp:TaichiLLVMContext@45] Creating Taichi llvm context for arch: x64
[T 04/22/20 20:17:41.181] [llvm_context.cpp:get_this_thread_data@615] Creating thread local data for thread 139921867896640
[T 04/22/20 20:17:41.181] [llvm_context.cpp:TaichiLLVMContext@70] Taichi llvm context created.
[T 04/22/20 20:17:41.181] [program.cpp:Program@133] Program (0x56273c85ee90) arch=opengl initialized.
[T 04/22/20 20:17:41.182] [/home/yuanming/repos/taichi/python/taichi/lang/expr.py:__setitem__@175] Materializing layout...
[T 04/22/20 20:17:41.298] [program.cpp:materialize_layout@273] materialize_layout called
[T 04/22/20 20:17:41.298] [program.cpp:materialize_layout@298] OpenGL root buffer size: 2048 B
[T 04/22/20 20:17:41.299] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:41.299] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:41.299] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:41.299] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:41.314] [/home/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@414] Compiling kernel fill_c44_0...
[D 04/22/20 20:17:41.318] [opengl_api.cpp:display_kernel_info@290] source of kernel [fill_c44_00] * 1:
#version 430 core
#extension GL_ARB_compute_shader: enable
#extension GL_ARB_gpu_shader_int64: enable
#extension GL_NV_shader_atomic_float: enable
#extension GL_NV_shader_atomic_float64: enable
#extension GL_NV_shader_atomic_int64: enable
precision highp float;
layout(packed, binding = 0) buffer data_i32 { int _states_[2]; int _data_i32_[]; };
layout(packed, binding = 0) buffer data_f32 { int _unused1_[2]; float _data_f32_[]; };
layout(packed, binding = 0) buffer data_f64 { int _unused2_[2]; double _data_f64_[]; };

void fill_c44_00()
{ // range for
  // range known at compile time
  int _tid = int(gl_GlobalInvocationID.x);
  if (_tid >= 256) return;
  int _itv = 0 + _tid * 1;
    int G = _itv;
    int H = (((0 + G) >> 7) & ((1 << 1) - 1));
    int L = (((0 + G) >> 6) & ((1 << 1) - 1));
    int O = (((0 + G) >> 5) & ((1 << 1) - 1));
    int R = (((0 + G) >> 4) & ((1 << 1) - 1));
    int U = (((0 + G) >> 3) & ((1 << 1) - 1));
    int X = (((0 + G) >> 2) & ((1 << 1) - 1));
    int Aq = (((0 + G) >> 1) & ((1 << 1) - 1));
    int At = (((0 + G) >> 0) & ((1 << 1) - 1));
    int Aw = H + L;
    int Ax = Aw + O;
    int Ay = Ax + R;
    int Az = Ay + U;
    int AA = Az + X;
    int AB = AA + Aq;
    int AC = AB + At;
    int AM = 0;
    int Bk8 = 0;
    int AO = AM + 2048 * Bk8; // S0
    int AP = AO + 0; // S1
    int AQ = (((0 + H) >> 0) & ((1 << 1) - 1));
    int AR = (((0 + L) >> 0) & ((1 << 1) - 1));
    int AS = (((0 + O) >> 0) & ((1 << 1) - 1));
    int AT = (((0 + R) >> 0) & ((1 << 1) - 1));
    int AU = (((0 + U) >> 0) & ((1 << 1) - 1));
    int AV = (((0 + X) >> 0) & ((1 << 1) - 1));
    int AW = (((0 + Aq) >> 0) & ((1 << 1) - 1));
    int AX = (((0 + At) >> 0) & ((1 << 1) - 1));
    int Bka = 1;
    int Bkb = AX * Bka;
    int Bkc = Bk8 + Bkb;
    int Bkd = 2;
    int Bke = AW * Bkd;
    int Bkf = Bkc + Bke;
    int Bkg = 4;
    int Bkh = AV * Bkg;
    int Bki = Bkf + Bkh;
    int Bkj = 8;
    int Bkk = AU * Bkj;
    int Bkl = Bki + Bkk;
    int Bkm = 16;
    int Bkn = AT * Bkm;
    int Bko = Bkl + Bkn;
    int Bkp = 32;
    int Bkq = AS * Bkp;
    int Bkr = Bko + Bkq;
    int Bks = 64;
    int Bkt = AR * Bks;
    int Bku = Bkr + Bkt;
    int Bkv = 128;
    int Bkw = AQ * Bkv;
    int Bkx = Bku + Bkw;
    int AZ = AP + 8 * Bkx; // S1
    int B0 = AZ + 0; // S2
    int B2 = atomicAdd(_data_i32_[B0 >> 2], AC);
    int Bq = AZ + 4; // S3
    int Bs = atomicAdd(_data_i32_[Bq >> 2], H);
}

void main()
{
  fill_c44_00();
}
layout(local_size_x = 256 /* 1, 256 */, local_size_y = 1, local_size_z = 1) in;

[T 04/22/20 20:17:41.318] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:41.318] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:41.318] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:41.318] [opengl_api.cpp:link@129] glLinkProgram OUT
[T 04/22/20 20:17:41.319] [opengl_api.cpp:compile@87] glCompileShader IN
[T 04/22/20 20:17:41.319] [opengl_api.cpp:compile@89] glCompileShader OUT
[T 04/22/20 20:17:41.319] [opengl_api.cpp:link@127] glLinkProgram IN
[T 04/22/20 20:17:41.319] [opengl_api.cpp:link@129] glLinkProgram OUT
.

==================================================== 1 passed in 7.31s =====================================================
[T 04/22/20 20:17:41.374] [program.cpp:finalize@481] Program finalizing...
[T 04/22/20 20:17:41.375] [program.cpp:finalize@512] Program (0x56273c85ee90) finalized.
xumingkuan commented 4 years ago

Looks like the only difference is the final two lines -- on Windows I've never seen them.

xumingkuan commented 4 years ago

Anyway, as long as they are mismatched in the same way with/without advanced optimization, we can still get a plausible comparison...

xumingkuan commented 4 years ago

@yuanming-hu I found an issue when doing CSE for global pointers: Case: test_ad_if_parallel_complex Before (good):

[I 06/30/20 20:38:44.108] [compile_to_offloads.cpp:taichi::lang::irpass::co
mpile_to_offloads::<lambda_a4464fe7c75e1f42a3a490ee54c7ec3e>::operator ()@2
3] Simplified I:
kernel {
  <f32 x1> $0 = const [1.0]
  <i32 x1> $1 = const [1]
  <i32 x1> $2 = const [0]
  <f32 x1> $3 = const [0.0]
  <i32 x1> $4 = const [2]
  $5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
    <i32 x1> $6 = loop $5 index 0
    <f32 x1> $7 = alloca
    <f32 x1> $8 : local store [$7 <- $3]
    <f32*x1> $9 = global ptr [S2place_f32], index [$6] activate=true
    <f32 x1> $10 = global load $9
    <i32 x1> $11 = cmp_gt $10 $3
    <i32 x1> $12 = bit_and $11 $1
    $13 : if $12 {
      <f32*x1> $14 = global ptr [S2place_f32], index [$6] activate=true
      <f32 x1> $15 = global load $14
      <f32 x1> $16 = div $0 $15
      <f32 x1> $17 : local store [$7 <- $16]
    }
    <f32 x1> $18 = local load [ [$7[0]]]
    <f32*x1> $19 = global ptr [S4place_f32], index [$6] activate=true
    <f32*x1> $20 : global store [$19 <- $18]
  }
}
[I 06/30/20 20:38:44.110] [compile_to_offloads.cpp:taichi::lang::irpass::co
mpile_to_offloads::<lambda_a4464fe7c75e1f42a3a490ee54c7ec3e>::operator ()@2
3] Gradient:
kernel {
  <f32 x1> $0 = const [1.0]
  <i32 x1> $1 = const [1]
  <i32 x1> $2 = const [0]
  <f32 x1> $3 = const [0.0]
  <i32 x1> $4 = const [2]
  $5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
    <f32 x1> $6 = alloca
    <f32 x1> $7 = alloca
    <f32 x1> $8 = alloca
    <f32 x1> $9 = alloca
    <i32 x1> $10 = loop $5 index 0
    <f32 x1> $11 = stack alloc (max_size=16)
    <f32 x1> $12 : stack push $11, val = $3
    <f32*x1> $13 = global ptr [S2place_f32], index [$10] activate=true
    <f32 x1> $14 = global load $13
    <i32 x1> $15 = cmp_gt $14 $3
    <i32 x1> $16 = bit_and $15 $1
    $17 : if $16 {
      <f32*x1> $18 = global ptr [S2place_f32], index [$10] activate=true
      <f32 x1> $19 = global load $18
      <f32 x1> $20 : local store [$6 <- $19]
      <f32 x1> $21 = div $0 $19
      <f32 x1> $22 : stack push $11, val = $21
    }
    <f32*x1> $23 = global ptr [S4place_f32], index [$10] activate=true
    <f32*x1> $24 = global ptr [S6place_f32], index [$10] activate=true
    <f32 x1> $25 = global load $24
    <f32 x1> $26 : stack acc adj $11, val = $25
    $27 : if $16 {
      <f32 x1> $28 = stack load top adj $11
      <f32 x1> $29 = local load [ [$9[0]]]
      <f32 x1> $30 = add $29 $28
      <f32 x1> $31 : local store [$9 <- $30]
      <f32 x1> $32 : stack pop $11
      <f32 x1> $33 = local load [ [$6[0]]]
      <f32 x1> $34 = div $30 $33
      <f32 x1> $35 = local load [ [$8[0]]]
      <f32 x1> $36 = add $35 $34
      <f32 x1> $37 : local store [$8 <- $36]
      <f32 x1> $38 = mul $33 $33
      <f32 x1> $39 = div $30 $38
      <f32 x1> $40 = neg $39
      <f32 x1> $41 = local load [ [$7[0]]]
      <f32 x1> $42 = add $41 $40
      <f32 x1> $43 : local store [$7 <- $42]
      <f32*x1> $44 = global ptr [S5place_f32], index [$10] activate=true
      <f32 x1> $45 = atomic add($44, $42)
    }
    <f32*x1> $46 = global ptr [S5place_f32], index [$10] activate=true
    <f32 x1> $47 = atomic add($46, $3)
    <f32 x1> $48 : stack pop $11
  }
}

After(bad, with some debug output in full_simplify()):

[I 06/30/20 20:43:33.360] [compile_to_offloads.cpp:taichi::lang::irpass::co
mpile_to_offloads::<lambda_a4464fe7c75e1f42a3a490ee54c7ec3e>::operator ()@2
3] Simplified I:
kernel {
  <f32 x1> $0 = const [1.0]
  <i32 x1> $1 = const [1]
  <i32 x1> $2 = const [0]
  <f32 x1> $3 = const [0.0]
  <i32 x1> $4 = const [2]
  $5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
    <i32 x1> $6 = loop $5 index 0
    <f32*x1> $7 = global ptr [S2place_f32], index [$6] activate=true
    <f32 x1> $8 = global load $7
    <i32 x1> $9 = cmp_gt $8 $3
    <i32 x1> $10 = bit_and $9 $1
    <f32 x1> $11 = div $0 $8
    <f32 x1> $12 = select($10, $11, $3)
    <f32*x1> $13 = global ptr [S4place_f32], index [$6] activate=true
    <f32*x1> $14 : global store [$13 <- $12]
  }
}
before simplify
kernel {
  <f32 x1> $0 = const [1.0]
  <i32 x1> $1 = const [1]
  <i32 x1> $2 = const [0]
  <f32 x1> $3 = const [0.0]
  <i32 x1> $4 = const [2]
  $5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
    <i32 x1> $6 = loop $5 index 0
    <f32*x1> $7 = global ptr [S2place_f32], index [$6] activate=true
    <f32 x1> $8 = global load $7
    <i32 x1> $9 = cmp_gt $8 $3
    <i32 x1> $10 = bit_and $9 $1
    <f32 x1> $11 = div $0 $8
    <f32 x1> $12 = select($10, $11, $3)
    <f32*x1> $13 = global ptr [S4place_f32], index [$6] activate=true
    <f32*x1> $14 : global store [$13 <- $12]
  }
}
after simplify
kernel {
  <f32 x1> $0 = const [1.0]
  <i32 x1> $1 = const [1]
  <i32 x1> $2 = const [0]
  <f32 x1> $3 = const [0.0]
  <i32 x1> $4 = const [2]
  $5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
    <i32 x1> $6 = loop $5 index 0
    <f32*x1> $7 = global ptr [S2place_f32], index [$6] activate=true
    <f32 x1> $8 = global load $7
    <i32 x1> $9 = cmp_gt $8 $3
    <i32 x1> $10 = bit_and $9 $1
    <f32 x1> $11 = div $0 $8
    <f32 x1> $12 = select($10, $11, $3)
    <f32*x1> $13 = global ptr [S4place_f32], index [$6] activate=true
    <f32*x1> $14 : global store [$13 <- $12]
  }
}
after cse
kernel {
  <f32 x1> $0 = const [1.0]
  <i32 x1> $1 = const [1]
  <i32 x1> $2 = const [0]
  <f32 x1> $3 = const [0.0]
  <i32 x1> $4 = const [2]
  $5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
    <i32 x1> $6 = loop $5 index 0
    <f32*x1> $7 = global ptr [S2place_f32], index [$6] activate=true
    <f32 x1> $8 = global load $7
    <i32 x1> $9 = cmp_gt $8 $3
    <i32 x1> $10 = bit_and $9 $1
    <f32 x1> $11 = div $0 $8
    <f32 x1> $12 = select($10, $11, $3)
    <f32*x1> $13 = global ptr [S4place_f32], index [$6] activate=true
    <f32*x1> $14 : global store [$13 <- $12]
  }
}
before simplify
kernel {
  <f32 x1> $205 = const [0.0]
  <i32 x1> $1 = const [1]
  <i32 x1> $2 = const [0]
  <f32 x1> $3 = const [0.0]
  <i32 x1> $4 = const [2]
  $5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
    <f32 x1> $229 = alloca
    <f32 x1> $220 = alloca
    <f32 x1> $214 = alloca
    <f32 x1> $208 = alloca
    <f32 x1> $201 = alloca
    <i32 x1> $6 = loop $5 index 0
    <f32*x1> $7 = global ptr [S2place_f32], index [$6] activate=true
    <f32 x1> $8 = global load $7
    <i32 x1> $9 = cmp_gt $8 $3
    <i32 x1> $10 = bit_and $9 $1
    <f32*x1> $13 = global ptr [S4place_f32], index [$6] activate=true
    <f32*x1> $199 = global ptr [S6place_f32], index [$6] activate=true
    <f32 x1> $200 = global load $199
    <f32 x1> $202 = local load [ [$201[0]]]
    <f32 x1> $203 = add $202 $200
    <f32 x1> $204 : local store [$201 <- $203]
    <f32 x1> $206 = local load [ [$201[0]]]
    <f32 x1> $207 = select($10, $206, $205)
    <f32 x1> $209 = local load [ [$208[0]]]
    <f32 x1> $210 = add $209 $207
    <f32 x1> $211 : local store [$208 <- $210]
    <f32 x1> $212 = local load [ [$201[0]]]
    <f32 x1> $213 = select($10, $205, $212)
    <f32 x1> $215 = local load [ [$214[0]]]
    <f32 x1> $216 = add $215 $213
    <f32 x1> $217 : local store [$214 <- $216]
    <f32 x1> $218 = local load [ [$208[0]]]
    <f32 x1> $219 = div $218 $8
    <f32 x1> $221 = local load [ [$220[0]]]
    <f32 x1> $222 = add $221 $219
    <f32 x1> $223 : local store [$220 <- $222]
    <f32 x1> $224 = mul $8 $8
    <f32 x1> $225 = local load [ [$208[0]]]
    <f32 x1> $227 = div $225 $224
    <f32 x1> $228 = neg $227
    <f32 x1> $230 = local load [ [$229[0]]]
    <f32 x1> $231 = add $230 $228
    <f32 x1> $232 : local store [$229 <- $231]
    <f32*x1> $233 = global ptr [S5place_f32], index [$6] activate=true
    <f32 x1> $234 = local load [ [$229[0]]]
    <f32 x1> $235 = atomic add($233, $234)
  }
}
after simplify
kernel {
  <f32 x1> $205 = const [0.0]
  <i32 x1> $1 = const [1]
  <i32 x1> $2 = const [0]
  <f32 x1> $3 = const [0.0]
  <i32 x1> $4 = const [2]
  $5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
    <i32 x1> $6 = loop $5 index 0
    <f32*x1> $7 = global ptr [S2place_f32], index [$6] activate=true
    <f32 x1> $8 = global load $7
    <i32 x1> $9 = cmp_gt $8 $3
    <i32 x1> $10 = bit_and $9 $1
    <f32*x1> $13 = global ptr [S4place_f32], index [$6] activate=true
    <f32*x1> $199 = global ptr [S6place_f32], index [$6] activate=true
    <f32 x1> $200 = global load $199
    <f32 x1> $236 = const [0.0]
    <f32 x1> $203 = add $236 $200
    <f32 x1> $207 = select($10, $203, $205)
    <f32 x1> $237 = const [0.0]
    <f32 x1> $210 = add $237 $207
    <f32 x1> $224 = mul $8 $8
    <f32 x1> $227 = div $210 $224
    <f32 x1> $228 = neg $227
    <f32 x1> $240 = const [0.0]
    <f32 x1> $231 = add $240 $228
    <f32*x1> $233 = global ptr [S5place_f32], index [$6] activate=true
    <f32 x1> $235 = atomic add($233, $231)
  }
}
after cse
kernel {
  <f32 x1> $205 = const [0.0]
  <i32 x1> $1 = const [1]
  <i32 x1> $2 = const [0]
  <i32 x1> $4 = const [2]
  $5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
    <i32 x1> $6 = loop $5 index 0
    <f32*x1> $7 = global ptr [S2place_f32], index [$6] activate=true
    <f32 x1> $8 = global load $7
    <i32 x1> $9 = cmp_gt $8 $205
    <i32 x1> $10 = bit_and $9 $1
    <f32*x1> $13 = global ptr [S4place_f32], index [$6] activate=true
    <f32*x1> $199 = global ptr [S6place_f32], index [$6] activate=true
    <f32 x1> $200 = global load $199
    <f32 x1> $203 = add $205 $200
    <f32 x1> $207 = select($10, $203, $205)
    <f32 x1> $210 = add $205 $207
    <f32 x1> $224 = mul $8 $8
    <f32 x1> $227 = div $210 $224
    <f32 x1> $228 = neg $227
    <f32 x1> $231 = add $205 $228
    <f32*x1> $233 = global ptr [S5place_f32], index [$6] activate=true
    <f32 x1> $235 = atomic add($233, $231)
  }
}
before simplify
kernel {
  <f32 x1> $205 = const [0.0]
  <i32 x1> $1 = const [1]
  <i32 x1> $2 = const [0]
  <i32 x1> $4 = const [2]
  $5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
    <i32 x1> $6 = loop $5 index 0
    <f32*x1> $7 = global ptr [S2place_f32], index [$6] activate=true
    <f32 x1> $8 = global load $7
    <i32 x1> $9 = cmp_gt $8 $205
    <i32 x1> $10 = bit_and $9 $1
    <f32*x1> $13 = global ptr [S4place_f32], index [$6] activate=true
    <f32*x1> $199 = global ptr [S6place_f32], index [$6] activate=true
    <f32 x1> $200 = global load $199
    <f32 x1> $207 = select($10, $200, $205)
    <f32 x1> $224 = mul $8 $8
    <f32 x1> $227 = div $207 $224
    <f32 x1> $228 = neg $227
    <f32*x1> $233 = global ptr [S5place_f32], index [$6] activate=true
    <f32 x1> $235 = atomic add($233, $228)
  }
}
after simplify
kernel {
  <f32 x1> $205 = const [0.0]
  <i32 x1> $1 = const [1]
  <i32 x1> $2 = const [0]
  <i32 x1> $4 = const [2]
  $5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
    <i32 x1> $6 = loop $5 index 0
    <f32*x1> $7 = global ptr [S2place_f32], index [$6] activate=true
    <f32 x1> $8 = global load $7
    <i32 x1> $9 = cmp_gt $8 $205
    <i32 x1> $10 = bit_and $9 $1
    <f32*x1> $13 = global ptr [S4place_f32], index [$6] activate=true
    <f32*x1> $199 = global ptr [S6place_f32], index [$6] activate=true
    <f32 x1> $200 = global load $199
    <f32 x1> $207 = select($10, $200, $205)
    <f32 x1> $224 = mul $8 $8
    <f32 x1> $227 = div $207 $224
    <f32 x1> $228 = neg $227
    <f32*x1> $233 = global ptr [S5place_f32], index [$6] activate=true
    <f32 x1> $235 = atomic add($233, $228)
  }
}
after cse
kernel {
  <f32 x1> $205 = const [0.0]
  <i32 x1> $1 = const [1]
  <i32 x1> $2 = const [0]
  <i32 x1> $4 = const [2]
  $5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
    <i32 x1> $6 = loop $5 index 0
    <f32*x1> $7 = global ptr [S2place_f32], index [$6] activate=true
    <f32 x1> $8 = global load $7
    <i32 x1> $9 = cmp_gt $8 $205
    <i32 x1> $10 = bit_and $9 $1
    <f32*x1> $13 = global ptr [S4place_f32], index [$6] activate=true
    <f32*x1> $199 = global ptr [S6place_f32], index [$6] activate=true
    <f32 x1> $200 = global load $199
    <f32 x1> $207 = select($10, $200, $205)
    <f32 x1> $224 = mul $8 $8
    <f32 x1> $227 = div $207 $224
    <f32 x1> $228 = neg $227
    <f32*x1> $233 = global ptr [S5place_f32], index [$6] activate=true
    <f32 x1> $235 = atomic add($233, $228)
  }
}
before simplify
kernel {
  <f32 x1> $205 = const [0.0]
  <i32 x1> $1 = const [1]
  <i32 x1> $2 = const [0]
  <i32 x1> $4 = const [2]
  $5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
    <i32 x1> $6 = loop $5 index 0
    <f32*x1> $7 = global ptr [S2place_f32], index [$6] activate=true
    <f32 x1> $8 = global load $7
    <i32 x1> $9 = cmp_gt $8 $205
    <i32 x1> $10 = bit_and $9 $1
    <f32*x1> $13 = global ptr [S4place_f32], index [$6] activate=true
    <f32*x1> $199 = global ptr [S6place_f32], index [$6] activate=true
    <f32 x1> $200 = global load $199
    <f32 x1> $207 = select($10, $200, $205)
    <f32 x1> $224 = mul $8 $8
    <f32 x1> $227 = div $207 $224
    <f32 x1> $228 = neg $227
    <f32*x1> $233 = global ptr [S5place_f32], index [$6] activate=true
    <f32 x1> $235 = atomic add($233, $228)
  }
}
after simplify
kernel {
  <f32 x1> $205 = const [0.0]
  <i32 x1> $1 = const [1]
  <i32 x1> $2 = const [0]
  <i32 x1> $4 = const [2]
  $5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
    <i32 x1> $6 = loop $5 index 0
    <f32*x1> $7 = global ptr [S2place_f32], index [$6] activate=true
    <f32 x1> $8 = global load $7
    <i32 x1> $9 = cmp_gt $8 $205
    <i32 x1> $10 = bit_and $9 $1
    <f32*x1> $13 = global ptr [S4place_f32], index [$6] activate=true
    <f32*x1> $199 = global ptr [S6place_f32], index [$6] activate=true
    <f32 x1> $200 = global load $199
    <f32 x1> $207 = select($10, $200, $205)
    <f32 x1> $224 = mul $8 $8
    <f32 x1> $227 = div $207 $224
    <f32 x1> $228 = neg $227
    <f32*x1> $233 = global ptr [S5place_f32], index [$6] activate=true
    <f32 x1> $235 = atomic add($233, $228)
  }
}
after cse
kernel {
  <f32 x1> $205 = const [0.0]
  <i32 x1> $1 = const [1]
  <i32 x1> $2 = const [0]
  <i32 x1> $4 = const [2]
  $5 : for in range($2, $4) (vectorize 1) block_dim=adaptive {
    <i32 x1> $6 = loop $5 index 0
    <f32*x1> $7 = global ptr [S2place_f32], index [$6] activate=true
    <f32 x1> $8 = global load $7
    <i32 x1> $9 = cmp_gt $8 $205
    <i32 x1> $10 = bit_and $9 $1
    <f32*x1> $13 = global ptr [S4place_f32], index [$6] activate=true
    <f32*x1> $199 = global ptr [S6place_f32], index [$6] activate=true
    <f32 x1> $200 = global load $199
    <f32 x1> $207 = select($10, $200, $205)
    <f32 x1> $224 = mul $8 $8
    <f32 x1> $227 = div $207 $224
    <f32 x1> $228 = neg $227
    <f32*x1> $233 = global ptr [S5place_f32], index [$6] activate=true
    <f32 x1> $235 = atomic add($233, $228)
  }
}
[I 06/30/20 20:43:33.386] [compile_to_offloads.cpp:taichi::lang::irpass::co
mpile_to_offloads::<lambda_a4464fe7c75e1f42a3a490ee54c7ec3e>::operator ()@2
3] Gradient:
kernel {
  <f32 x1> $0 = const [0.0]
  <i32 x1> $1 = const [1]
  <i32 x1> $2 = const [0]
  <i32 x1> $3 = const [2]
  $4 : for in range($2, $3) (vectorize 1) block_dim=adaptive {
    <i32 x1> $5 = loop $4 index 0
    <f32*x1> $6 = global ptr [S2place_f32], index [$5] activate=true
    <f32 x1> $7 = global load $6
    <i32 x1> $8 = cmp_gt $7 $0
    <i32 x1> $9 = bit_and $8 $1
    <f32*x1> $10 = global ptr [S4place_f32], index [$5] activate=true
    <f32*x1> $11 = global ptr [S6place_f32], index [$5] activate=true
    <f32 x1> $12 = global load $11
    <f32 x1> $13 = select($9, $12, $0)
    <f32 x1> $14 = mul $7 $7
    <f32 x1> $15 = div $13 $14
    <f32 x1> $16 = neg $15
    <f32*x1> $17 = global ptr [S5place_f32], index [$5] activate=true
    <f32 x1> $18 = atomic add($17, $16)
  }
}

I think although the IRs in Simplified I are different, both look pretty good... But after Gradient, the latter becomes wrong.

xumingkuan commented 4 years ago

Final IR: Good:

[I 06/30/20 20:38:44.166] [compile_to_offloads.cpp:taichi::lang::irpass::co
mpile_to_offloads::<lambda_a4464fe7c75e1f42a3a490ee54c7ec3e>::operator ()@2
3] Simplified III:
kernel {
  $0 = offloaded range_for(0, 2) block_dim=adaptive
  body {
    <f32 x1> $1 = const [1.0]
    <f32 x1> $2 = alloca
    <i32 x1> $3 = loop $0 index 0
    <f32 x1> $4 = stack alloc (max_size=16)
    <f32 x1> $5 = const [0.0]
    <f32 x1> $6 : stack push $4, val = $5
    <gen*x1> $7 = get root
    <i32 x1> $8 = const [0]
    <gen*x1> $9 = [S0root][root]::lookup($7, $8) activate = false
    <gen*x1> $10 = get child [S0root->S1dense] $9
    <i32 x1> $11 = const [1]
    <gen*x1> $12 = [S1dense][dense]::lookup($10, $3) activate = false
    <f32*x1> $13 = get child [S1dense->S2place_f32] $12
    <f32 x1> $14 = global load $13
    <i32 x1> $15 = cmp_gt $14 $5
    <i32 x1> $16 = bit_and $15 $11
    $17 : if $16 {
      <f32 x1> $18 = global load $13
      <f32 x1> $19 : local store [$2 <- $18]
      <f32 x1> $20 = div $1 $18
      <f32 x1> $21 : stack push $4, val = $20
    }
    <gen*x1> $22 = get child [S0root->S3dense] $9
    <gen*x1> $23 = [S3dense][dense]::lookup($22, $3) activate = false
    <f32*x1> $24 = get child [S3dense->S6place_f32] $23
    <f32 x1> $25 = global load $24
    <f32 x1> $26 : stack acc adj $4, val = $25
    <f32 x1> $27 = stack load top adj $4
    <f32 x1> $28 = local load [ [$2[0]]]
    <f32 x1> $29 = mul $28 $28   <--- probably 0*0
    <f32 x1> $30 = div $27 $29   <--- nan
    <f32 x1> $31 = neg $30   <--- nan
    <f32*x1> $32 = get child [S1dense->S5place_f32] $12
    <f32 x1> $33 = global load $32
    <f32 x1> $34 = add $33 $31   <--- nan
    $35 : if $16 {   <--- good!
      <f32*x1> $36 : global store [$32 <- $34]
    }
    <f32 x1> $37 = global load $32
    <f32 x1> $38 : global store [$32 <- $37]
  }
}

Bad(nan):

[I 06/30/20 20:43:33.481] [compile_to_offloads.cpp:taichi::lang::irpass::co
mpile_to_offloads::<lambda_a4464fe7c75e1f42a3a490ee54c7ec3e>::operator ()@2
3] Simplified III:
kernel {
  $0 = offloaded range_for(0, 2) block_dim=adaptive
  body {
    <i32 x1> $1 = loop $0 index 0
    <gen*x1> $2 = get root
    <i32 x1> $3 = const [0]
    <gen*x1> $4 = [S0root][root]::lookup($2, $3) activate = false
    <gen*x1> $5 = get child [S0root->S1dense] $4
    <i32 x1> $6 = const [1]
    <gen*x1> $7 = [S1dense][dense]::lookup($5, $1) activate = false
    <f32*x1> $8 = get child [S1dense->S2place_f32] $7
    <f32 x1> $9 = global load $8
    <f32 x1> $10 = const [0.0]
    <i32 x1> $11 = cmp_gt $9 $10
    <i32 x1> $12 = bit_and $11 $6
    <gen*x1> $13 = get child [S0root->S3dense] $4
    <gen*x1> $14 = [S3dense][dense]::lookup($13, $1) activate = false
    <f32*x1> $15 = get child [S3dense->S6place_f32] $14
    <f32 x1> $16 = global load $15
    <f32 x1> $17 = select($12, $16, $10)
    <f32 x1> $18 = mul $9 $9   <--- probably 0*0
    <f32 x1> $19 = div $17 $18  <--- nan
    <f32 x1> $20 = neg $19   <--- nan
    <f32*x1> $21 = get child [S1dense->S5place_f32] $7
    <f32 x1> $22 = global load $21
    <f32 x1> $23 = add $22 $20   <--- nan
    <f32 x1> $24 : global store [$21 <- $23]   <--- bad
  }
}