ufal / treex

Treex NLP framework
33 stars 6 forks source link

Write::Treex renders the iset structure invalid #23

Open dan-zeman opened 8 years ago

dan-zeman commented 8 years ago

This bug surfaces when Interset is used after the document is written. Normally it does not happen because Write::Treex tends to be the last block of the scenario. But try to write the document in two different formats, i.e. your last two blocks will be Write::Treex and Write::CoNLLU (in this order). And make sure that there is a node with a multi-valued interset feature, e.g. prontype=int|rel.

The Treex file is written correctly while CoNLL-U is not. Apparently when Write::CoNLLU takes its turn, the Lingua::Interset::FeatureStructure object is already corrupt. Somehow the "int|rel" string makes it down to the hash value, which would not happen if the set() method of the FeatureStructure was used.

Therefore I suspect that there is a bug in one of the methods of the Node::Interset role.

dan-zeman commented 8 years ago

I tracked the problem down but I don't see an easy solution. The serialize_iset() method in TC::Node::Interset calls $self->set_attr("iset/prontype", ['int', 'rel']);. The set_attr() method, defined in TC::Node, first converts the list of values to Treex::PML::List. Later on, it accesses the iset feature structure as a plain hash (instead of as Lingua::Interset::FeatureStructure) and ruthlessly pushes the converted value there.

I understand that this hack helps the writer make sure that the list of values appears in the Treex file as <prontype>int|rel</prontype>.

But I don't like the idea that any code outside of the Lingua::Interset::FeatureStructure (or even outside Lingua::Interset) accesses the attributes of the FeatureStructure object directly.

Any thoughts?

martinpopel commented 8 years ago

Do you have a minimal test case? Some treex oneliner?

dan-zeman commented 8 years ago

Yes. First a correct example:

perl -e 'print "1\twho\twho\tPRON\t\t\t0\troot\t\t\n\n"' | treex -Len Read::CoNLLU Util::Eval anode='$.iset->set("prontype", "int|rel")' Write::CoNLLU

prints this:

# sent_id a_tree-en-s1-root
1       who     who     PRON    _       PronType=Int,Rel        0       root    _       _

Then add the bug (Write::Treex):

perl -e 'print "1\twho\twho\tPRON\t\t\t0\troot\t\t\n\n"' | treex -Len Read::CoNLLU Util::Eval anode='$.iset->set("prontype", "int|rel")' Write::Treex to=/dev/null Write::CoNLLU

prints this:

# sent_id a_tree-en-s1-root
1       who     who     PRON    _       PronType=Int|rel        0       root    _       _
martinpopel commented 8 years ago

The serialize_iset() method in TC::Node::Interset calls $self->set_attr("iset/prontype", ['int', 'rel']);

No, it calls $self->set_attr("iset/prontype", 'int|rel'); because $node->get_iset($feature) returns always string ('int|rel'). This is needed because the PML schema says <member name="prontype"><cdata format="any"/></member>. If an array reference is saved into format=any, you would get <prontype>ARRAY(0xd5e998)</prontype> in the treex file.

One workaround would be to check if Write::Treex (there is an ugly way how to do it) is the last block and if not, then deserialize iset here.

I will think about a better solution. This "blocks after Write::Treex" affects also wild attributes.

dan-zeman commented 8 years ago

Can't it be deserialized always, after the file's been written?

dan-zeman commented 8 years ago

NB: The problem is not urgent for me. Once I know that the bug is not directly in Interset, I can easily adjust the scenario and work around it.

But of course we want to solve it eventually.

martinpopel commented 8 years ago

Can't it be deserialized always, after the file's been written?

That would slow down all Write::Treex. More writers in a scenario is a rare usecase (but fully legal, so I agree we should fix this issue).

dan-zeman commented 8 years ago

I know it would slow it down (although I do not know how much) but does it really matter if it is the last block in the scenario?

martinpopel commented 8 years ago

I know it would slow it down (although I do not know how much)

I did no benchmark, but you need to access all nodes in the document and try to reset all get_nonempty_features().

does it really matter if it is the last block

In most cases yes. If more documents are to be processed by the same job, you need to wait until the writer finishes. Even in single-document processing on one machine I always wait until treex finishes before I (or some script) try to access the output treex file.