Closed m0g1cian closed 1 month ago
@m0g1cian opened an upstream issue: https://github.com/numba/numba/issues/9542
Per the thread, it appears to be an upstream bug on the numba side due to UnicodeCharSeq
having trouble handling leading null byte \x00
.
There are a few options here:
UnicodeCharSeq
handling (https://github.com/numba/numba/issues/9542#issuecomment-2086887203)"三"
(b'\xe4\xb8\x89'
) becomes b'\xe4\xb8\x89\x00'
. Would require additional complexity in the form of pre and post-processing.UnicodeCharSeq
. A bit hackier but a modification of @m0g1cian's reproduction script here https://github.com/numba/numba/issues/9542#issuecomment-2079477926 would be as follows:import numba
import numpy as np
from numba.cpython.charseq import unicode_charseq_get_code
@numba.njit
def function():
s = np.empty(3, dtype="<U1")
s[0] = " ^`"
s[1] = " ^l"
s[2] = " ^i"
return [unicode_charseq_get_code(item, 0) for item in s]
result = function()
print(result)
Output: [19968, 20108, 32]
@m0g1cian opened an upstream issue: https://github.com/numba/numba/issues/9542
Per the thread, it appears to be an upstream bug on the numba side due to
UnicodeCharSeq
having trouble handling leading null byte\x00
.There are a few options here:
- 1) Write an upstream fix for numbas
UnicodeCharSeq
handling (https://github.com/numba/numba/issues/9542#issuecomment-2086887203)- 2) Hacky: Haven't done analysis of whether this would work for sure, but we might consider moving the null byte to the end of the string.
"三"
(b'\xe4\xb8\x89'
) becomesb'\xe4\xb8\x89\x00'
. Would require additional complexity in the form of pre and post-processing.- 3) Operate on int arrays instead of
UnicodeCharSeq
. A bit hackier but a modification of @m0g1cian's reproduction script here https://github.com/numba/numba/issues/9542#issuecomment-2079477926 would be as follows:import numba import numpy as np from numba.cpython.charseq import unicode_charseq_get_code @numba.njit def function(): s = np.empty(3, dtype="<U1") s[0] = " ^`" s[1] = " ^l" s[2] = " ^i" return [unicode_charseq_get_code(item, 0) for item in s] result = function() print(result)
Output:
[19968, 20108, 32]
I made a local patch to fix this issue in outlines. It basically makes numba typed Dict or List always use unicode_type
rather than unicode_charseq
I'll make a PR soon.
Describe the issue as clearly as possible:
Update 2
Can confirm there's something wrong with Numba's Typed Dict implementation. Check issue here
Update
When
outlines
buildsBetterFSM
from a reference FSM (e.g. frominteregular
), if the reference FSM contains Chinese character "一", the correspondingnumba.typed.Dict
used byBetterFSM::alphabet_symbol_map
somehow converts this character into an empty string, causing a KeyError whenever__getitem__
is triggered .Steps/code to reproduce the bug:
debug_keyerror.py
Some insight:
print (k, v) in
alphabet_symbol_mapping_items
beforecreate_fsm_info()
(right afteroutlines.fsm.regex.py::96
)print (k, v) in
alphabet_symbol_mapping_items
increate_fsm_info()
when buildingalphabet_symbol_map
(right afteroutlines.fsm.regex.py::139
)Expected result:
I was able to get the expected result after tweaking two places:
outlines.fsm.regex.py::112
: changenb_unichar_2_type = numba.types.UnicodeCharSeq(2)
tonb_unichar_2_type = numba.types.unicode_type
outlines.fsm.regex.py::89
: changealphabet_symbol_mapping_items
to a simple python listalphabet_symbol_mapping_items = list((k,v) for k, v in self.alphabet._symbol_mapping.items() if k != anything_else)
Error message:
Outlines/Python version information:
Version information
Context for the issue:
I not sure why only the Chinese character "一" breaks everything while other Chinese characters are working fine as far as I can tell.