Open GoogleCodeExporter opened 9 years ago
I've had a look at the source, and I think I've tracked this down to a memory
bug in mecab itself.
LatticeImpl::set_sentence uses has_request_type() to determine whether it
should allocate new memory for the sentence or just reuse the memory passed as
its `sentence' argument. However, the various TaggerImpl::parse* methods all
call lattice->set_sentence *before* they properly set the request type in the
lattice (via TaggerImpl::initRequestType()). This means that on each call to a
tagger parse method the lattice uses the previous call's request type. On the
first call to a tagger parse method the lattice uses whatever its request_type_
is initialised to.
The end result is that when calling the tagger parse methods sometimes the
lattice incorrectly reuses the memory it has been passed instead of allocating
new memory. The python wrapper or python runtime may subsequently reallocate
that memory for other uses and it may get overwritten with new data. Then the
nodes returned by parseToNode no longer point to the surface text of the
sentence.
The fix should be to call set_sentence after the request type has been set.
I've attached a patch against the 0.996 source download for mecab. It fixes the
behaviour in this bug report.
Original comment by richard....@gmail.com
on 19 Mar 2013 at 3:46
Attachments:
Original issue reported on code.google.com by
richard....@gmail.com
on 18 Mar 2013 at 1:03