terralang / terra

Terra is a low-level system programming language that is embedded in and meta-programmed by the Lua programming language.
terralang.org
Other
2.72k stars 201 forks source link

Colon (":") in naming of anonymous functions breaks emscripten #536

Open PyryM opened 2 years ago

PyryM commented 2 years ago

TLDR: Terra gives anonymous functions names like $anon (junk/wasm_helloworld.t:7) containing special characters (specifically the :) that break the way emscripten expects to parse symbol names.

To reproduce: First (with Terra on llvm10+ and Emscripten installed) compile to wasm32 bitcode:

terralib.includepath = "" -- no default includes from system libc
local eminclude = os.getenv("EMSDK") .. "/upstream/emscripten/cache/sysroot/include"
local target = terralib.newtarget{Triple = "wasm32"}
local cio = terralib.includec("stdio.h", {"-I", eminclude}, target)

local foo = terra(i: int32)
  cio.printf("helloworld %d!\n", i)
end

terra helloworld_main(): int32
  for i = 0, 10 do foo(i) end
  return 0
end

terralib.saveobj("helloworld.bc", {main=helloworld_main}, nil, target, false)

Now try to link with emscripten:

emcc helloworld.bc -sALLOW_MEMORY_GROWTH=1 -o helloworld.html

Traceback (most recent call last):
  File "/home/anon/emsdk/upstream/emscripten/emcc.py", line 3982, in <module>
    sys.exit(main(sys.argv))
  [...traceback skipped]
  File "/home/anon/emsdk/upstream/emscripten/tools/building.py", line 574, in parse_llvm_nm_symbols
    status = line[entry_pos + 11] # Skip address, which is always fixed-length 8 chars.
IndexError: string index out of range

Why? Emscripten gets symbol names by calling llvm-nm --print-file-names helloworld.bc and parsing each line using colons as delimiters:

# Line format: "[archive filename:]object filename: address status name"
entry_pos = line.rfind(':') # finds *last* colon

But terra has produced this:

llvm-nm --print-file-name helloworld.bc 

helloworld.bc: -------- t $anon (junk/wasm_helloworld.t:7)
helloworld.bc: -------- T main
helloworld.bc:          U printf

Where emscripten incorrectly splits the line helloworld.bc: -------- t $anon (junk/wasm_helloworld.t:7) because it finds the colon inside the symbol name.

Workaround: It's possible to avoid the issue by making sure every terra function is named, using func:setname(...) as needed.

Fix?: Arguably this is Emscripten's fault for trying to parse human-readable tool output rather than using actual structured APIs, and for not even robustly parsing that output.

It might make sense, though, on the Terra side to give anonymous functions more sanitized names (i.e., without spaces, colons, or parenthesis) because there are likely a number of tools that expect symbol names in bitcode to be limited to C/C++ naming rules.

velartrill commented 2 years ago

this is definitely an Emscripten bug (note that file names with colons, which are perfectly legal on linux, would trigger this bug as well!), and if terra is to be tweaked to add a workaround, it should be optional imo. i would suggest either a terralib.saveobj flag/environment variable to use hashes in the generated anonymous names, or a way to customize the format (e.g. you pass a function that takes a terra function object and returns an appropriate name)

elliottslaughter commented 2 years ago

Can we at least check with the Emscripten developers to see what their outlook is on this one? Since a workaround is available on our end, I don't think we need to rush the fix.

PyryM commented 2 years ago

Yes, there's an existing issue: https://github.com/emscripten-core/emscripten/issues/15325

sbc100 commented 2 years ago

If you compile the .bc file to an object file (emcc -c hello_world.bc -o hello_world.o) does it still contain that non-standard symbols?

Are these non-standard symbols only ever internal/local symbols? (i.e. they always have lower case tags when output by nm)?