kawu / concraft-pl

A morphosyntactic tagger for Polish based on conditional random fields
http://zil.ipipan.waw.pl/Concraft
BSD 2-Clause "Simplified" License
20 stars 2 forks source link

Input DAG format #44

Closed djstrong closed 4 years ago

djstrong commented 4 years ago

Is there some tool to obtain input DAG format from Morfeusz output? In other words: how to use pre-trained model of Concraft-pl 2.0 with plain text?

BTW example files do not work:

$ concraft-pl tag DasModel-2019-10-08.gz -i example/test.dag -o output.dag
concraft-pl: parseRule: input too long in tag ppron3:pl:acc:f:ter:neut:praep
CallStack (from HasCallStack):
  error, called at ./Data/Tagset/Positional.hs:125:22 in tagset-positional-0.3.1-LwkfvYfoWWCIFQIVumc6gj:Data.Tagset.Positional
$ concraft-pl tag DasModel-2019-10-08.gz -i example/train.dag -o output.dag
concraft-pl: parseRule: no value for acm attribute in tag num:pl:acc:m2:ncol
CallStack (from HasCallStack):
  error, called at ./Data/Tagset/Positional.hs:118:27 in tagset-positional-0.3.1-LwkfvYfoWWCIFQIVumc6gj:Data.Tagset.Positional
kawu commented 4 years ago

You can use the Python client code to convert Morfeusz output to the corresponding DAG input. Please let me know if this doesn't work for you.

Thanks for the report concerning the example .dag files. These files assume the example tagset configuration from the example directory. The latest pre-trained model apparently uses a different configuration (consistent with Morfeusz2), hence the tool breaks (which is intentional, although I admit the error message could be better).

djstrong commented 4 years ago

Thanks. I haven't tried Python yet, but tried another input (head -n 2 example/test.dag), which, I guess, has correct tags:

0   1   Dziennik    dziennik    subst:sg:nom:m3         1.000           
0   1   Dziennik    dziennik    subst:sg:acc:m3         0.000

with error:

concraft-pl: [parseRow] expected 11 columns, got 1

CallStack (from HasCallStack):
  error, called at src/NLP/Concraft/Polish/DAG/Format/Base.hs:297:7 in concraft-pl-2.5.0-Dp2TntXF9iF50fra61nj20:NLP.Concraft.Polish.DAG.Format.Base
kawu commented 4 years ago

head -n 2 example/test.dag doesn't work because there's no obligatory empty line at the end. I updated the README file to make it clear that this blank line is required, although in the long-run Concraft should probably handle input without it, too.

djstrong commented 4 years ago

The Python binding is working as described.

marstona commented 3 years ago

Model: http://zil.ipipan.waw.pl/Concraft?action=AttachFile&do=view&target=concraft-pl-model-SGJP-20200818.gz

Usage: $ ./concraft-pl server --port=3000 -i concraft-pl-model-SGJP-20200818.gz Setting phasers to stun... (port 3000) (ctrl-c to quit)

Word: sumie

Input DAG: {"dag":"0\t1\tsumie\tsum\tsubst:sg:loc:m2\t\t\t0.000\t\t\t\n0\t1\tsumie\tsum\tsubst:sg:voc:m2\t\t\t0.000\t\t\t\n0\t1\tsumie\tsuma\tsubst:sg:dat.loc:f\t\t\t0.000\t\t\t"}

Result: parseRule: no value for cas attribute in tag subst:sg:dat.loc:f CallStack (from HasCallStack): error, called at ./Data/Tagset/Positional.hs:118:27 in tagset-positional-0.3.1-LwkfvYfoWWCIFQIVumc6gj:Data.Tagset.Positional

kawu commented 3 years ago

The problem seems to be that the input DAG contains non-expanded tags, e.g. subst:sg:dat.loc:f with under-specified case dat.loc. If you are using Morfeusz Python bindings, you should create the Morfeusz object with expand_tags set to True as described here:

morfeusz = Morfeusz(expand_tags=True)

Does that solve the issue?

marstona commented 3 years ago

Understood. I'm using Morfausz java binding. I'll check the solution with "expand_tags" and let know.

marstona commented 3 years ago

I've checked the binding sources and unfortunately there is no feature like "expand tags" in java binding.

marstona commented 3 years ago

With option MorfeuszUsage.BOTH_ANALYSE_AND_GENERATE I got more expanded DAG:

{ "dag": "0\t1\tsuma\tsuma\tsubst:sg:nom:f\t\t\t0.000\t\t\t\n0\t1\tsumy\tsuma\tsubst:sg:gen:f\t\t\t0.000\t\t\t\n0\t1\tsumie\tsuma\tsubst:sg:dat.loc:f\t\t\t0.000\t\t\t\n0\t1\tsumę\tsuma\tsubst:sg:acc:f\t\t\t0.000\t\t\t\n0\t1\tsumą\tsuma\tsubst:sg:inst:f\t\t\t0.000\t\t\t\n0\t1\tsumo\tsuma\tsubst:sg:voc:f\t\t\t0.000\t\t\t\n0\t1\tsumy\tsuma\tsubst:pl:nom.acc.voc:f\t\t\t0.000\t\t\t\n0\t1\tsum\tsuma\tsubst:pl:gen:f\t\t\t0.000\t\t\t\n0\t1\tsumom\tsuma\tsubst:pl:dat:f\t\t\t0.000\t\t\t\n0\t1\tsumami\tsuma\tsubst:pl:inst:f\t\t\t0.000\t\t\t\n0\t1\tsumach\tsuma\tsubst:pl:loc:f\t\t\t0.000\t\t\t" }

but it still gives the error:

parseRule: no value for cas attribute in tag subst:pl:nom.acc.voc:f CallStack (from HasCallStack): error, called at ./Data/Tagset/Positional.hs:118:27 in tagset-positional-0.3.1-LwkfvYfoWWCIFQIVumc6gj:Data.Tagset.Positional

kawu commented 3 years ago

Indeed, I can't find anything like the "expand tags" feature in the Java bindings either. MorfeuszUsage.BOTH_ANALYSE_AND_GENERATE doesn't seem related to this issue. The DAG still contains non-expanded tags (e.g. subst:pl:nom.acc.voc:f), which use what is called "notacja kropkowa" in the documentation. I will forward your question to the maintainers of Morfeusz2.

kawu commented 3 years ago

I've asked at the source, expanding tags is a bonus feature of the Python bindings only. This means that, with Java bindings, you have to perform the expansion yourself as a post-processing step and before feeding the DAG to concraft. For instance, an arc with subst:pl:nom.acc.voc:f should be expanded to 3 separate arcs with tags subst:pl:nom:f, subst:pl:acc:f, and subst:pl:voc:f, respectively. If you have trouble doing that, let me know. I could also implement this feature in concraft as optional pre-processing step.

marstona commented 3 years ago

Yes, I've done this already based on concraft train data examples and python binding source code. Now it works :)

kawu commented 3 years ago

Great, I'm glad it works :).