inukshuk / anystyle

Fast citation reference parsing
https://anystyle.io
Other
1.04k stars 90 forks source link

Error with parser.train() with "type" of model #102

Open a-fent opened 6 years ago

a-fent commented 6 years ago

For a good while, I've been experiencing an intermittent error when calling parser.train. The error is thrown up from the Wapiti level and look like this:

   <..> /wapiti-ruby/lib/wapiti/options.rb:154:in `validate!': unknown type: crfxpr (ArgumentError)
   Þ¸Qbød; unknown algorithm: l-bfgseg}¸Qbø 

The strange thing is, exactly the same code will sometimes work fine. I've so far been able to isolate down to the fact that the following code will return different strings on different runs:

require 'bundler' 
Bundler.setup # -> hopefully a clean Wapiti 1.0.2
require 'anystyle'
AnyStyle::Dictionary.defaults[:adapter] = :hash # mingw-ruby gdbm is broken
puts AnyStyle.parser.model.options.type

Sometimes it returns "crf" (as expected, I think), sometimes "crf#####" where #### is a random bunch of chars.

I'm also getting consistent segfaults with parser.check("foo.xml"), unless the result is 100% correct. It appears to be rising from native_label in wapiti/lib/model.rb.

This all on Windows MingW, do you see anything at all similar?

inukshuk commented 6 years ago

Thanks, I'll take a look at this!

I haven't seen the error myself (I can reproduce segfaults when running check on a dataset containing labels which are not present in the model, but that's definitely unrelated to this).

Is gdbm definitely broken on Windows? Since it's part of the standard library, I was hoping this would be easy to install across platforms.

a-fent commented 6 years ago

Thanks for the checking, confirmation.

GDBM is fine on Windows e.g. with the "standard" ruby installer. It's just the builds that come from mingw have a broken package, and I haven't got round to working out why.