Open dan-zeman opened 8 years ago
I tracked the problem down but I don't see an easy solution. The serialize_iset()
method in TC::Node::Interset calls $self->set_attr("iset/prontype", ['int', 'rel']);
. The set_attr()
method, defined in TC::Node, first converts the list of values to Treex::PML::List. Later on, it accesses the iset
feature structure as a plain hash (instead of as Lingua::Interset::FeatureStructure) and ruthlessly pushes the converted value there.
I understand that this hack helps the writer make sure that the list of values appears in the Treex file as <prontype>int|rel</prontype>
.
But I don't like the idea that any code outside of the Lingua::Interset::FeatureStructure (or even outside Lingua::Interset) accesses the attributes of the FeatureStructure object directly.
Any thoughts?
Do you have a minimal test case? Some treex oneliner?
Yes. First a correct example:
perl -e 'print "1\twho\twho\tPRON\t\t\t0\troot\t\t\n\n"' | treex -Len Read::CoNLLU Util::Eval anode='$.iset->set("prontype", "int|rel")' Write::CoNLLU
prints this:
# sent_id a_tree-en-s1-root 1 who who PRON _ PronType=Int,Rel 0 root _ _
Then add the bug (Write::Treex):
perl -e 'print "1\twho\twho\tPRON\t\t\t0\troot\t\t\n\n"' | treex -Len Read::CoNLLU Util::Eval anode='$.iset->set("prontype", "int|rel")' Write::Treex to=/dev/null Write::CoNLLU
prints this:
# sent_id a_tree-en-s1-root 1 who who PRON _ PronType=Int|rel 0 root _ _
The
serialize_iset()
method in TC::Node::Interset calls$self->set_attr("iset/prontype", ['int', 'rel']);
No, it calls $self->set_attr("iset/prontype", 'int|rel');
because $node->get_iset($feature)
returns always string ('int|rel'
).
This is needed because the PML schema says <member name="prontype"><cdata format="any"/></member>
. If an array reference is saved into format=any, you would get <prontype>ARRAY(0xd5e998)</prontype>
in the treex file.
One workaround would be to check if Write::Treex (there is an ugly way how to do it) is the last block and if not, then deserialize iset here.
I will think about a better solution. This "blocks after Write::Treex" affects also wild attributes.
Can't it be deserialized always, after the file's been written?
NB: The problem is not urgent for me. Once I know that the bug is not directly in Interset, I can easily adjust the scenario and work around it.
But of course we want to solve it eventually.
Can't it be deserialized always, after the file's been written?
That would slow down all Write::Treex. More writers in a scenario is a rare usecase (but fully legal, so I agree we should fix this issue).
I know it would slow it down (although I do not know how much) but does it really matter if it is the last block in the scenario?
I know it would slow it down (although I do not know how much)
I did no benchmark, but you need to access all nodes in the document and try to reset all get_nonempty_features()
.
does it really matter if it is the last block
In most cases yes. If more documents are to be processed by the same job, you need to wait until the writer finishes. Even in single-document processing on one machine I always wait until treex finishes before I (or some script) try to access the output treex file.
This bug surfaces when Interset is used after the document is written. Normally it does not happen because Write::Treex tends to be the last block of the scenario. But try to write the document in two different formats, i.e. your last two blocks will be Write::Treex and Write::CoNLLU (in this order). And make sure that there is a node with a multi-valued interset feature, e.g.
prontype=int|rel
.The Treex file is written correctly while CoNLL-U is not. Apparently when Write::CoNLLU takes its turn, the Lingua::Interset::FeatureStructure object is already corrupt. Somehow the "int|rel" string makes it down to the hash value, which would not happen if the set() method of the FeatureStructure was used.
Therefore I suspect that there is a bug in one of the methods of the
Node::Interset
role.