Closed djstrong closed 4 years ago
You can use the Python client code to convert Morfeusz output to the corresponding DAG input. Please let me know if this doesn't work for you.
Thanks for the report concerning the example .dag files. These files assume the example tagset configuration from the example directory. The latest pre-trained model apparently uses a different configuration (consistent with Morfeusz2), hence the tool breaks (which is intentional, although I admit the error message could be better).
Thanks. I haven't tried Python yet, but tried another input (head -n 2 example/test.dag
), which, I guess, has correct tags:
0 1 Dziennik dziennik subst:sg:nom:m3 1.000
0 1 Dziennik dziennik subst:sg:acc:m3 0.000
with error:
concraft-pl: [parseRow] expected 11 columns, got 1
CallStack (from HasCallStack):
error, called at src/NLP/Concraft/Polish/DAG/Format/Base.hs:297:7 in concraft-pl-2.5.0-Dp2TntXF9iF50fra61nj20:NLP.Concraft.Polish.DAG.Format.Base
head -n 2 example/test.dag
doesn't work because there's no obligatory empty line at the end. I updated the README file to make it clear that this blank line is required, although in the long-run Concraft should probably handle input without it, too.
The Python binding is working as described.
Model:
http://zil.ipipan.waw.pl/Concraft?action=AttachFile&do=view&target=concraft-pl-model-SGJP-20200818.gz
Usage:
$ ./concraft-pl server --port=3000 -i concraft-pl-model-SGJP-20200818.gz Setting phasers to stun... (port 3000) (ctrl-c to quit)
Word: sumie
Input DAG:
{"dag":"0\t1\tsumie\tsum\tsubst:sg:loc:m2\t\t\t0.000\t\t\t\n0\t1\tsumie\tsum\tsubst:sg:voc:m2\t\t\t0.000\t\t\t\n0\t1\tsumie\tsuma\tsubst:sg:dat.loc:f\t\t\t0.000\t\t\t"}
Result:
parseRule: no value for cas attribute in tag subst:sg:dat.loc:f CallStack (from HasCallStack): error, called at ./Data/Tagset/Positional.hs:118:27 in tagset-positional-0.3.1-LwkfvYfoWWCIFQIVumc6gj:Data.Tagset.Positional
The problem seems to be that the input DAG contains non-expanded tags, e.g. subst:sg:dat.loc:f
with under-specified case dat.loc
. If you are using Morfeusz Python bindings, you should create the Morfeusz
object with expand_tags
set to True
as described here:
morfeusz = Morfeusz(expand_tags=True)
Does that solve the issue?
Understood. I'm using Morfausz java binding. I'll check the solution with "expand_tags" and let know.
I've checked the binding sources and unfortunately there is no feature like "expand tags" in java binding.
With option MorfeuszUsage.BOTH_ANALYSE_AND_GENERATE
I got more expanded DAG:
{ "dag": "0\t1\tsuma\tsuma\tsubst:sg:nom:f\t\t\t0.000\t\t\t\n0\t1\tsumy\tsuma\tsubst:sg:gen:f\t\t\t0.000\t\t\t\n0\t1\tsumie\tsuma\tsubst:sg:dat.loc:f\t\t\t0.000\t\t\t\n0\t1\tsumę\tsuma\tsubst:sg:acc:f\t\t\t0.000\t\t\t\n0\t1\tsumą\tsuma\tsubst:sg:inst:f\t\t\t0.000\t\t\t\n0\t1\tsumo\tsuma\tsubst:sg:voc:f\t\t\t0.000\t\t\t\n0\t1\tsumy\tsuma\tsubst:pl:nom.acc.voc:f\t\t\t0.000\t\t\t\n0\t1\tsum\tsuma\tsubst:pl:gen:f\t\t\t0.000\t\t\t\n0\t1\tsumom\tsuma\tsubst:pl:dat:f\t\t\t0.000\t\t\t\n0\t1\tsumami\tsuma\tsubst:pl:inst:f\t\t\t0.000\t\t\t\n0\t1\tsumach\tsuma\tsubst:pl:loc:f\t\t\t0.000\t\t\t" }
but it still gives the error:
parseRule: no value for cas attribute in tag subst:pl:nom.acc.voc:f CallStack (from HasCallStack): error, called at ./Data/Tagset/Positional.hs:118:27 in tagset-positional-0.3.1-LwkfvYfoWWCIFQIVumc6gj:Data.Tagset.Positional
Indeed, I can't find anything like the "expand tags" feature in the Java bindings either. MorfeuszUsage.BOTH_ANALYSE_AND_GENERATE
doesn't seem related to this issue. The DAG still contains non-expanded tags (e.g. subst:pl:nom.acc.voc:f
), which use what is called "notacja kropkowa" in the documentation. I will forward your question to the maintainers of Morfeusz2.
I've asked at the source, expanding tags is a bonus feature of the Python bindings only. This means that, with Java bindings, you have to perform the expansion yourself as a post-processing step and before feeding the DAG to concraft
. For instance, an arc with subst:pl:nom.acc.voc:f
should be expanded to 3 separate arcs with tags subst:pl:nom:f
, subst:pl:acc:f
, and subst:pl:voc:f
, respectively. If you have trouble doing that, let me know. I could also implement this feature in concraft
as optional pre-processing step.
Yes, I've done this already based on concraft train data examples and python binding source code. Now it works :)
Great, I'm glad it works :).
Is there some tool to obtain input DAG format from Morfeusz output? In other words: how to use pre-trained model of Concraft-pl 2.0 with plain text?
BTW example files do not work: