scrapinghub / python-crfsuite

A python binding for crfsuite
MIT License
770 stars 222 forks source link

tagger.info() cannot read multi-line matrices #71

Open Dawny33 opened 6 years ago

Dawny33 commented 6 years ago

Not sure this is a bug, or a feature request, but I shall still write it down as more people would definitely be experiencing this.

When the features of a crf also contain numpy matrices, like the word2vec vector of a word; the tagger.info() is not being able to recognize them, as the regex pattern is not recognized. The error thrown due to the unavailability of a detected regex group looks like this:

Traceback (most recent call last):
  File "test_dumpparser.py", line 8, in <module>
    parser.feed(line.decode('utf8'))
  File "/Library/Python/2.7/site-packages/pycrfsuite/_dumpparser.py", line 62, in feed
    getattr(self, 'parse_%s' % self.state)(line)
  File "/Library/Python/2.7/site-packages/pycrfsuite/_dumpparser.py", line 74, in parse_ATTRIBUTES
    self.result.attributes[m.group(2)] = m.group(1)
AttributeError: 'NoneType' object has no attribute 'group'

This is due to the parsing logic in the _dumpparser.py file.

A solution for that would be to encode the matrix and then, pass it into the CRF model as a feature. (base64.b64encode(narray))

davidsbatista commented 6 years ago

I was getting similar errors when I had features that included \n characters or \s, try replacing those by a special token, e.g.: #NEWLINE, or #SPACE

radostyle commented 4 years ago

This is still an issue

aimlnerd commented 2 years ago

This is still an

I was getting similar errors when I had features that included \n characters or \s, try replacing those by a special token, e.g.: #NEWLINE, or #SPACE

This is still an issue and replacing \n characters or \s is not a solution. Since I need to find the original position of predicted token in original text. When i replace \n with in original text. This is no longer possible.