Open esc opened 1 month ago
Summary of all no-source use-cases. (Comment will be edited).
Links to useful utilities to consider. (Comment will be edited).
Here is a possible way to augment exec
in such a way that it remembers the source code:
import inspect
import linecache
def smart_exec(source, *args):
if isinstance(source, str):
n = smart_exec.n
smart_exec.n += 1
filename = f'<smart_exec {n}>'
lines = [(x + "\n") for x in source.splitlines()]
linecache.cache[filename] = (1, None, lines, filename)
obj = compile(source, filename, 'exec')
else:
obj = source
return exec(obj, *args)
smart_exec.n = 0
src = """
def foo(x, y):
return x + y
"""
d = {}
#exec(src, d) # source code not available
smart_exec(src, d) # source code available
foo = d['foo']
print(inspect.getsource(foo))
The trick is that both inspect
and traceback
rely on linecache
to get the source code.
It is used by the good old py
lib, in particular for py.code.Source
:
https://github.com/pytest-dev/py/blob/master/py/_code/source.py#L193-L198
How to use it for numba, it's an open question. I suspect that you can either:
builtins.exec
and hope for the bestnumba.exec
and require people to use it in case they want to exec()
code which contains @numba.jit
functionsSummary of all no-source use-cases. (Comment will be edited). [...] pickle
I'm curious: how do you end up in a no-source code situation with pickle
?
Numba functions can be pickled (cloudpickle, to be precise) for remote execution (most commonly, with Dask), and will be re-compiled on the target system in case it does not match the client.
@sklam will have to remind me if there's a way to avoid this situation by always using LLVM IR, and if we are sure we don't ever have to go back to bytecode for compilation. I've got some vague recollection of possible issues with embedded symbol addresses, but maybe we've fixed those so caching works better as well.
Numba functions can be pickled (cloudpickle, to be precise) for remote execution (most commonly, with Dask), and will be re-compiled on the target system in case it does not match the client.
ok but if the pickle comes from pickling an already-compiled numba function, then we have control on what goes into it, and we can "just" make to include all the data necessary for recompilation (e.g., the source code itself or some form of IR).
Yeah, I think that will cover the most common case. I don't know if anyone is applying the Numba decorator to functions after unpickling on the destination. That seems unlikely, but the user base is big enough that I don't know. 😅
I don't know if anyone is applying the Numba decorator to functions after unpickling on the destination.
I think that this case is also covered, because pickled functions are just stored as a module.funcname
pair, so the function must exist on the other side anyway.
Example:
import pickle
import pickletools
def foo(x, y):
pass
s = pickle.dumps(foo)
pickletools.dis(s)
$ python /tmp/pickletest.py
0: \x80 PROTO 4
2: \x95 FRAME 20
11: \x8c SHORT_BINUNICODE '__main__'
21: \x94 MEMOIZE (as 0)
22: \x8c SHORT_BINUNICODE 'foo'
27: \x94 MEMOIZE (as 1)
28: \x93 STACK_GLOBAL
29: \x94 MEMOIZE (as 2)
30: . STOP
highest protocol among opcodes = 4
I think that this case is also covered, because pickled functions are just stored as a
module.funcname
pair, so the function must exist on the other side anyway.
Dask uses cloudpickle which support loading from serialized bytecode if the host process determine the code object to be dynamic.
@sklam will have to remind me if there's a way to avoid this situation by always using LLVM IR, and if we are sure we don't ever have to go back to bytecode for compilation. I've got some vague recollection of possible issues with embedded symbol addresses, but maybe we've fixed those so caching works better as well.
It is doable to just use LLVM-IR or even machine code. It is the same problem as disk-cache. However, Numba currently only transfer the function bytecode with list of already compiled signatures. Recompilation from bytecode occurs on the unpickling machine.
@sklam will have to remind me if there's a way to avoid this situation by always using LLVM IR, and if we are sure we don't ever have to go back to bytecode for compilation. I've got some vague recollection of possible issues with embedded symbol addresses, but maybe we've fixed those so caching works better as well.
It is doable to just use LLVM-IR or even machine code. It is the same problem as disk-cache. However, Numba currently only transfer the function bytecode with list of already compiled signatures. Recompilation from bytecode occurs on the unpickling machine.
I'm not sure that it is quite the same problem as a local disk cache, because a cluster might be heterogeneous in architecture? Recompilation from LLVM IR can only really occur if the LLVM IR does not contain architectural details and is only likely to achieve effective performance if it hasn't already been optimised towards the details of some hardware (e.g. vector width). PIXIE hits these issues just within the same ISA.
This issue collects information for and plans towards creating a Source/AST frontend for Numba.
Context: Numba targets the cpython bytecode implementation since the source code may not always be available under all circumstances. Unfortunately this causes a significant maintenance burden for the project since the bytecode is not a stable interface designed to be targeted by third party applications. The Numba project currently requires about three to six person months every year to adapt to the latest bytecode semantics introduced with each annually released Python minor version bump implementation of cpython. Naturally, it would be nice for Numba to have a Source/AST frontend instead, as this would significantly reduce the maintenance burden.