kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.5k stars 449 forks source link

Problem with Wapiti model portability on some (rare) Linux machines #43

Closed kermitt2 closed 9 years ago

kermitt2 commented 9 years ago

The Wapiti binary models are not recognized on a few Linux machines.

The error is coming from model.c in Wapiti, when the header of the model is parsed via fscanf:

    278-/* mdl_load:
    279- *   Read back a previously saved model to continue training or start labeling.
    280- *   The returned model is synced and the quarks are locked. You must give to
    281- *   this function an empty model fresh from mdl_new.
    282- */
    283-void mdl_load(mdl_t *mdl, FILE *file) {
    284:    const char *err = "invalid model format";
    285-    uint64_t nact = 0;
    286-    int type;
    287-    if (fscanf(file, "#mdl#%d#%"SCNu64"\n", &type, &nact) == 2) {
    288-        mdl->type = type;
    289-    } else {
    290-        rewind(file);
    291-        if (fscanf(file, "#mdl#%"SCNu64"\n", &nact) == 1)
    292-            mdl->type = 0;
    293-        else
    294-            fatal(err);
    295-    }
    296-    rdr_load(mdl->reader, file);
    297-    mdl_sync(mdl);
    298-    for (uint64_t i = 0; i < nact; i++) {
    299-        uint64_t f;
    300-        double v;
    301-        if (fscanf(file, "%"SCNu64"=%la\n", &f, &v) != 2)
    302-            fatal(err);
    303-        mdl->theta[f] = v;
    304-    }
    305-}
    306-

The header of the model looks like this on the problematic machine:

    > find grobid/grobid-home/models/ -name "*wapiti" -print -exec head -n2 \{} \;
grobid/grobid-home/models/header/model.wapiti
#mdl#2#314470
#rdr#85/29/0

If the model is retrained on the problematic machine, it is working. However, the header format looks the same:

  > head -n2 grobid/grobid-home/models/date/model.wapiti
  #mdl#2#262  
  #rdr#50/16/0
  12:u00:%x[-3,0],

Users having this issue can use CRF++ as JNI CRF engine instead of Wapiti (a little bit slower, takes more memory, use smaller models - because of GitHub limitation on binary file size - but the result are similar).

In the file grobid-home/config/grobid.properties, simply change:

  grobid.crf.engine=wapiti
  #grobid.crf.engine=crfpp

by

 #grobid.crf.engine=wapiti
 grobid.crf.engine=crfpp
kermitt2 commented 9 years ago

Ok first guess, the mpl->type is not expressed in a portable way. We have in model.h:

int       type;    //       model type

which is serialized in model.c with (line 271):

fprintf(file, "#mdl#%d#%"PRIu64"\n", mdl->type, nact);

%d is suspicious as a portable format specifier... If we use uint64_t, the correct macro would be SCNu64 and PRIu64 for the type as well.

kermitt2 commented 9 years ago

Other hypotheses to test maybe:

rloth commented 9 years ago

Your second hypothese is the right one. On my machine, the bug was systematic in locale fr.FR-UTF-8

Everything went back to normal after doing simply: export LC_ALL=C

Thanks for this suggestion !

kermitt2 commented 9 years ago

Great thanks a lot Romain! Let's try to find a way now to force the LOCALE in Wapiti, so that the library becomes independent of the environement's LOCALE.

kermitt2 commented 9 years ago

The locale has been set in our Wapiti trunk with the C locale.h lib before reading and saving a model. See http://en.wikipedia.org/wiki/C_localization_functions. It does not affect the Locale of the environment which is unchanged.

Having tested Grobid after setting the environment Locale to fr_FR.UTF-8, grobid worked fine, so it should be solved with commit bbdea1c613f59fb97ff6615c4e16b75adfb3109b

kermitt2 commented 9 years ago

It looks like nobody complained anymore about this problem after the fix, so let's close it ;)