Closed trister95 closed 1 year ago
I can reproduce the problem with a simple text file and feeding it twice as you said, ucto crashes with a segfault (which is not something that should ever happen).
It seems there are some loose ends we need to solve if we want to call tokenize()
successively. I now wonder if it used to work or if this bug was always there. What you could as as a workaround in the meantime, is simply reinstantiate the tokenizer for each run:
import ucto
configurationfile_ucto = "tokconfig-nld-historical"
files = ["test.txt", "test.txt"]
for f in files:
tokenizer = ucto.Tokenizer(configurationfile_ucto, foliaoutput = True)
tokenizer.tokenize(f, "/tmp/")
This is a bit less performant due to the added initialization time every iteration, but hopefully still manageable.
As to the crash, I produced the following traceback so we (me and @kosloot?) can debug and fix it:
(gdb) bt
#0 folia::processor::generate_id (this=this@entry=0x5555556da9c0, prov=prov@entry=0x0, name="uctodata") at folia_provenance.cxx:168
#1 0x00007ffff68d762d in folia::processor::processor (this=this@entry=0x5555556da9c0, prov=0x0, parent=parent@entry=0x5555556d3df0,
atts_in=...) at folia_provenance.cxx:274
#2 0x00007ffff6897507 in folia::Document::add_processor (this=this@entry=0x5555556980d0, args=...,
parent=parent@entry=0x5555556d3df0) at folia_document.cxx:1068
#3 0x00007ffff7e1ba71 in Tokenizer::TokenizerClass::add_provenance_data (this=this@entry=0x7ffff774a020,
doc=doc@entry=0x5555556980d0, parent=parent@entry=0x0) at tokenize.cxx:533
#4 0x00007ffff7e1c182 in Tokenizer::TokenizerClass::add_provenance_setting (this=this@entry=0x7ffff774a020,
doc=doc@entry=0x5555556980d0, parent=parent@entry=0x0) at tokenize.cxx:603
#5 0x00007ffff7e1ccd8 in Tokenizer::TokenizerClass::start_document (this=this@entry=0x7ffff774a020, id="untitled")
at tokenize.cxx:663
#6 0x00007ffff7e28dd8 in Tokenizer::TokenizerClass::tokenize (this=this@entry=0x7ffff774a020, IN=...) at tokenize.cxx:937
#7 0x00007ffff7e2933f in Tokenizer::TokenizerClass::tokenize (this=0x7ffff774a020, IN=..., OUT=...) at tokenize.cxx:1007
#8 0x00007ffff7e2d9f2 in Tokenizer::TokenizerClass::tokenize (this=this@entry=0x7ffff774a020, ifile="test.txt", ofile="/tmp/")
at tokenize.cxx:999
#9 0x00007ffff7e7d706 in __pyx_pf_4ucto_9Tokenizer_2tokenize (__pyx_v_outputfile=<optimized out>, __pyx_v_inputfile=<optimized out>,
__pyx_v_self=0x7ffff774a010) at ucto_wrapper.cpp:3694
#10 __pyx_pw_4ucto_9Tokenizer_3tokenize (__pyx_v_self=0x7ffff774a010, __pyx_args=<optimized out>, __pyx_kwds=<optimized out>)
at ucto_wrapper.cpp:3649
#11 0x00007ffff7b57f4c in method_vectorcall_VARARGS_KEYWORDS (func=<optimized out>, args=0x7ffff7745bb0, nargsf=<optimized out>,
kwnames=<optimized out>) at Objects/descrobject.c:344
#12 0x00007ffff7b4676a in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7ffff7745bb0,
callable=0x7ffff6d7fe70, tstate=0x55555555e480) at ./Include/cpython/abstract.h:114
#13 PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7ffff7745bb0, callable=0x7ffff6d7fe70)
at ./Include/cpython/abstract.h:123
#14 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, trace_info=0x7fffffffd3c0,
tstate=<optimized out>) at Python/ceval.c:5891
#15 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=0x7ffff7745a40, throwflag=<optimized out>) at Python/ceval.c:4198
#16 0x00007ffff7b44f80 in _PyEval_EvalFrame (throwflag=0, f=0x7ffff7745a40, tstate=0x55555555e480)
at ./Include/internal/pycore_ceval.h:46
#17 _PyEval_Vector (tstate=tstate@entry=0x55555555e480, con=con@entry=0x7fffffffd4c0, locals=locals@entry=0x7ffff6d41dc0,
args=args@entry=0x0, argcount=argcount@entry=0, kwnames=kwnames@entry=0x0) at Python/ceval.c:5065
--Type <RET> for more, q to quit, c to continue without paging--
#18 0x00007ffff7bf39e4 in PyEval_EvalCode (co=0x7ffff6d3f470, globals=0x7ffff6d41dc0, locals=0x7ffff6d41dc0) at Python/ceval.c:1134
#19 0x00007ffff7c04383 in run_eval_code_obj (tstate=tstate@entry=0x55555555e480, co=co@entry=0x7ffff6d3f470,
globals=globals@entry=0x7ffff6d41dc0, locals=locals@entry=0x7ffff6d41dc0) at Python/pythonrun.c:1291
#20 0x00007ffff7bffaea in run_mod (mod=mod@entry=0x5555555de300, filename=filename@entry=0x7ffff6d2faf0,
globals=globals@entry=0x7ffff6d41dc0, locals=locals@entry=0x7ffff6d41dc0, flags=flags@entry=0x7fffffffd6a8,
arena=arena@entry=0x7ffff771fb90) at Python/pythonrun.c:1312
#21 0x00007ffff7aa223f in pyrun_file (fp=fp@entry=0x55555555a470, filename=filename@entry=0x7ffff6d2faf0, start=start@entry=257,
globals=globals@entry=0x7ffff6d41dc0, locals=locals@entry=0x7ffff6d41dc0, closeit=closeit@entry=1, flags=0x7fffffffd6a8)
at Python/pythonrun.c:1208
#22 0x00007ffff7aa1ef0 in _PyRun_SimpleFileObject (fp=0x55555555a470, filename=0x7ffff6d2faf0, closeit=1, flags=0x7fffffffd6a8)
at Python/pythonrun.c:456
#23 0x00007ffff7aa28a3 in _PyRun_AnyFileObject (fp=0x55555555a470, filename=0x7ffff6d2faf0, closeit=1, flags=0x7fffffffd6a8)
at Python/pythonrun.c:90
#24 0x00007ffff7c10b5d in pymain_run_file_obj (skip_source_first_line=0, filename=0x7ffff6d2faf0, program_name=0x7ffff77cb140)
at Modules/main.c:353
#25 pymain_run_file (config=0x5555555855a0) at Modules/main.c:372
#26 pymain_run_python (exitcode=0x7fffffffd6a4) at Modules/main.c:587
#27 Py_RunMain () at Modules/main.c:666
#28 0x00007ffff7be4f3b in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:720
#29 0x00007ffff783c790 in __libc_start_call_main (main=main@entry=0x555555555120 <main>, argc=argc@entry=2,
argv=argv@entry=0x7fffffffd8d8) at ../sysdeps/nptl/libc_start_call_main.h:58
#30 0x00007ffff783c84a in __libc_start_main_impl (main=0x555555555120 <main>, argc=2, argv=0x7fffffffd8d8, init=<optimized out>,
fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffd8c8) at ../csu/libc-start.c:360
#31 0x0000555555555045 in _start ()
I changed the title a bit, I know you meant "kernel" to refer to the jupyter kernel, but people might misunderstand and think the entire linux kernel crashed because of ucto, that'd be quite a feat ;)
Ok, this is definitely a bug in ucto itself. I can reproduce it without Python. It is a problem inside the tokenize( string, string ) function, so it seems. Needs some investigation
Some data was not reset on next invocation of tokenize(). Should be fixed now in Ucto.
Nice work! Are we ready for new releases? I guess such a crash warrants a new release quickly.
Thanks a lot for the quick replies! Great work! :)
ucto v0.29 and python-ucto v0.6.5 are now released, solving this issue
I am trying to tokenize a bunch of txt-files and store them as folia.xml-files.
The first file works fine, but after that the kernel crashes.
A little bit more info:
Am I doing something wrong, or is there a bug here?