Python wrapper: surface text garbled in first call to parseToNode

markcox / mecab

Automatically exported from code.google.com/p/mecab

0 stars 0 forks source link

What steps will reproduce the problem? $ python Python 2.7.3 (default, Aug 1 2012, 05:14:39) [GCC 4.6.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> result = "" >>> import MeCab >>> t = MeCab.Tagger() >>> n = t.parseToNode("結晶系は正方晶系。") >>> result = "" >>> while n is not None: ... result += n.surface ... n = n.next ... >>> assert result == "結晶系は正方晶系。", repr(result) Traceback (most recent call last): File "<stdin>", line 1, in <module> AssertionError: '\x01rf\xff\xff\xff\xff\xff\xff\xff' >>> What is the expected output? What do you see instead? The assertion should succeed (no exception thrown). What version of the product are you using? On what operating system? MeCab version 0.996 on Ubuntu Precise. Please provide any additional information below. On my machine the above code always reproduces the problem, but other code structures such as assigning the text to a variable before parsing or moving the test code into a function definition causes the test to run correctly. This bug only affects the initial call to a tagger and only if the call is parseToNode. The following incantation is a reliable workaround: >>> t = Tagger() >>> t.parse("") The tagger can then be used as normal.

I've had a look at the source, and I think I've tracked this down to a memory 
bug in mecab itself.

LatticeImpl::set_sentence uses has_request_type() to determine whether it 
should allocate new memory for the sentence or just reuse the memory passed as 
its `sentence' argument. However, the various TaggerImpl::parse* methods all 
call lattice->set_sentence *before* they properly set the request type in the 
lattice (via TaggerImpl::initRequestType()). This means that on each call to a 
tagger parse method the lattice uses the previous call's request type. On the 
first call to a tagger parse method the lattice uses whatever its request_type_ 
is initialised to.

The end result is that when calling the tagger parse methods sometimes the 
lattice incorrectly reuses the memory it has been passed instead of allocating 
new memory. The python wrapper or python runtime may subsequently reallocate 
that memory for other uses and it may get overwritten with new data. Then the 
nodes returned by parseToNode no longer point to the surface text of the 
sentence.

The fix should be to call set_sentence after the request type has been set. 
I've attached a patch against the 0.996 source download for mecab. It fixes the 
behaviour in this bug report.

Original comment by richard....@gmail.com on 19 Mar 2013 at 3:46

Attachments:

request_type.patch

markcox / mecab

Python wrapper: surface text garbled in first call to parseToNode #5