proycon / python-ucto

This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This binding makes the power of the ucto tokeniser available to Python. Ucto itself is regular-expression based, extensible, and advanced tokeniser written in C++ (http://ilk.uvt.nl/ucto).
29 stars 5 forks source link

Segmentation fault when tokenizer.tokenize() is used repetitively #16

Closed trister95 closed 1 year ago

trister95 commented 1 year ago

I am trying to tokenize a bunch of txt-files and store them as folia.xml-files.

The first file works fine, but after that the kernel crashes.

A little bit more info:

import ucto
configurationfile_ucto = "tokconfig-nld-historical"

tokenizer = ucto.Tokenizer(configurationfile_ucto, foliaoutput = True)

for f in list_with_paths_to_exact_same_files:
     tokenizer.tokenize(f, output_path)

Am I doing something wrong, or is there a bug here?

proycon commented 1 year ago

I can reproduce the problem with a simple text file and feeding it twice as you said, ucto crashes with a segfault (which is not something that should ever happen).

It seems there are some loose ends we need to solve if we want to call tokenize() successively. I now wonder if it used to work or if this bug was always there. What you could as as a workaround in the meantime, is simply reinstantiate the tokenizer for each run:

import ucto
configurationfile_ucto = "tokconfig-nld-historical"

files = ["test.txt", "test.txt"]
for f in files:
    tokenizer = ucto.Tokenizer(configurationfile_ucto, foliaoutput = True)
    tokenizer.tokenize(f, "/tmp/")

This is a bit less performant due to the added initialization time every iteration, but hopefully still manageable.

As to the crash, I produced the following traceback so we (me and @kosloot?) can debug and fix it:

(gdb) bt
#0  folia::processor::generate_id (this=this@entry=0x5555556da9c0, prov=prov@entry=0x0, name="uctodata") at folia_provenance.cxx:168
#1  0x00007ffff68d762d in folia::processor::processor (this=this@entry=0x5555556da9c0, prov=0x0, parent=parent@entry=0x5555556d3df0, 
    atts_in=...) at folia_provenance.cxx:274
#2  0x00007ffff6897507 in folia::Document::add_processor (this=this@entry=0x5555556980d0, args=..., 
    parent=parent@entry=0x5555556d3df0) at folia_document.cxx:1068
#3  0x00007ffff7e1ba71 in Tokenizer::TokenizerClass::add_provenance_data (this=this@entry=0x7ffff774a020, 
    doc=doc@entry=0x5555556980d0, parent=parent@entry=0x0) at tokenize.cxx:533
#4  0x00007ffff7e1c182 in Tokenizer::TokenizerClass::add_provenance_setting (this=this@entry=0x7ffff774a020, 
    doc=doc@entry=0x5555556980d0, parent=parent@entry=0x0) at tokenize.cxx:603
#5  0x00007ffff7e1ccd8 in Tokenizer::TokenizerClass::start_document (this=this@entry=0x7ffff774a020, id="untitled")
    at tokenize.cxx:663
#6  0x00007ffff7e28dd8 in Tokenizer::TokenizerClass::tokenize (this=this@entry=0x7ffff774a020, IN=...) at tokenize.cxx:937
#7  0x00007ffff7e2933f in Tokenizer::TokenizerClass::tokenize (this=0x7ffff774a020, IN=..., OUT=...) at tokenize.cxx:1007
#8  0x00007ffff7e2d9f2 in Tokenizer::TokenizerClass::tokenize (this=this@entry=0x7ffff774a020, ifile="test.txt", ofile="/tmp/")
    at tokenize.cxx:999
#9  0x00007ffff7e7d706 in __pyx_pf_4ucto_9Tokenizer_2tokenize (__pyx_v_outputfile=<optimized out>, __pyx_v_inputfile=<optimized out>, 
    __pyx_v_self=0x7ffff774a010) at ucto_wrapper.cpp:3694
#10 __pyx_pw_4ucto_9Tokenizer_3tokenize (__pyx_v_self=0x7ffff774a010, __pyx_args=<optimized out>, __pyx_kwds=<optimized out>)
    at ucto_wrapper.cpp:3649
#11 0x00007ffff7b57f4c in method_vectorcall_VARARGS_KEYWORDS (func=<optimized out>, args=0x7ffff7745bb0, nargsf=<optimized out>, 
    kwnames=<optimized out>) at Objects/descrobject.c:344
#12 0x00007ffff7b4676a in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7ffff7745bb0, 
    callable=0x7ffff6d7fe70, tstate=0x55555555e480) at ./Include/cpython/abstract.h:114
#13 PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7ffff7745bb0, callable=0x7ffff6d7fe70)
    at ./Include/cpython/abstract.h:123
#14 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, trace_info=0x7fffffffd3c0, 
    tstate=<optimized out>) at Python/ceval.c:5891
#15 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=0x7ffff7745a40, throwflag=<optimized out>) at Python/ceval.c:4198
#16 0x00007ffff7b44f80 in _PyEval_EvalFrame (throwflag=0, f=0x7ffff7745a40, tstate=0x55555555e480)
    at ./Include/internal/pycore_ceval.h:46
#17 _PyEval_Vector (tstate=tstate@entry=0x55555555e480, con=con@entry=0x7fffffffd4c0, locals=locals@entry=0x7ffff6d41dc0, 
    args=args@entry=0x0, argcount=argcount@entry=0, kwnames=kwnames@entry=0x0) at Python/ceval.c:5065
--Type <RET> for more, q to quit, c to continue without paging--
#18 0x00007ffff7bf39e4 in PyEval_EvalCode (co=0x7ffff6d3f470, globals=0x7ffff6d41dc0, locals=0x7ffff6d41dc0) at Python/ceval.c:1134
#19 0x00007ffff7c04383 in run_eval_code_obj (tstate=tstate@entry=0x55555555e480, co=co@entry=0x7ffff6d3f470, 
    globals=globals@entry=0x7ffff6d41dc0, locals=locals@entry=0x7ffff6d41dc0) at Python/pythonrun.c:1291
#20 0x00007ffff7bffaea in run_mod (mod=mod@entry=0x5555555de300, filename=filename@entry=0x7ffff6d2faf0, 
    globals=globals@entry=0x7ffff6d41dc0, locals=locals@entry=0x7ffff6d41dc0, flags=flags@entry=0x7fffffffd6a8, 
    arena=arena@entry=0x7ffff771fb90) at Python/pythonrun.c:1312
#21 0x00007ffff7aa223f in pyrun_file (fp=fp@entry=0x55555555a470, filename=filename@entry=0x7ffff6d2faf0, start=start@entry=257, 
    globals=globals@entry=0x7ffff6d41dc0, locals=locals@entry=0x7ffff6d41dc0, closeit=closeit@entry=1, flags=0x7fffffffd6a8)
    at Python/pythonrun.c:1208
#22 0x00007ffff7aa1ef0 in _PyRun_SimpleFileObject (fp=0x55555555a470, filename=0x7ffff6d2faf0, closeit=1, flags=0x7fffffffd6a8)
    at Python/pythonrun.c:456
#23 0x00007ffff7aa28a3 in _PyRun_AnyFileObject (fp=0x55555555a470, filename=0x7ffff6d2faf0, closeit=1, flags=0x7fffffffd6a8)
    at Python/pythonrun.c:90
#24 0x00007ffff7c10b5d in pymain_run_file_obj (skip_source_first_line=0, filename=0x7ffff6d2faf0, program_name=0x7ffff77cb140)
    at Modules/main.c:353
#25 pymain_run_file (config=0x5555555855a0) at Modules/main.c:372
#26 pymain_run_python (exitcode=0x7fffffffd6a4) at Modules/main.c:587
#27 Py_RunMain () at Modules/main.c:666
#28 0x00007ffff7be4f3b in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:720
#29 0x00007ffff783c790 in __libc_start_call_main (main=main@entry=0x555555555120 <main>, argc=argc@entry=2, 
    argv=argv@entry=0x7fffffffd8d8) at ../sysdeps/nptl/libc_start_call_main.h:58
#30 0x00007ffff783c84a in __libc_start_main_impl (main=0x555555555120 <main>, argc=2, argv=0x7fffffffd8d8, init=<optimized out>, 
    fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffd8c8) at ../csu/libc-start.c:360
#31 0x0000555555555045 in _start ()
proycon commented 1 year ago

I changed the title a bit, I know you meant "kernel" to refer to the jupyter kernel, but people might misunderstand and think the entire linux kernel crashed because of ucto, that'd be quite a feat ;)

kosloot commented 1 year ago

Ok, this is definitely a bug in ucto itself. I can reproduce it without Python. It is a problem inside the tokenize( string, string ) function, so it seems. Needs some investigation

kosloot commented 1 year ago

Some data was not reset on next invocation of tokenize(). Should be fixed now in Ucto.

proycon commented 1 year ago

Nice work! Are we ready for new releases? I guess such a crash warrants a new release quickly.

trister95 commented 1 year ago

Thanks a lot for the quick replies! Great work! :)

proycon commented 1 year ago

ucto v0.29 and python-ucto v0.6.5 are now released, solving this issue