ufal / udpipe

UDPipe: Trainable pipeline for tokenizing, tagging, lemmatizing and parsing Universal Treebanks and other CoNLL-U files
Mozilla Public License 2.0
358 stars 75 forks source link

wrong parent & child attachment [udapy -s ud.FixPunct < in.conllu > out.conllu] #189

Closed Shasetty closed 5 months ago

Shasetty commented 5 months ago
          I confirm the right brackets following FDA and NDA are attached to a wrong parent (i.e. not to FDA and NDA, respectively), when parsing this very long sentence with english-ewt-ud-2.12-230717. You can use [udapy -s ud.FixPunct < in.conllu > out.conllu](https://github.com/udapi/udapi-python/blob/master/udapi/block/ud/fixpunct.py) to fix it.

However, the output in "Show Trees" is exactly the same as in "Show Table" (and as the CoNLL-U in "Output Text"), so there is no bug in UDPipe. These GitHub issues are for reporting bugs in the software. You cannot expect 100% parsing accuracy from all models.

BTW: When using e.g. the english-gum-ud-2.12-230717 model, the brackets enclosing FDA and NDA are attached correctly. This suggest GUM is better training data then EWT in this aspect. Indeed, when applying ud.FixPunct on en_gum-ud-train.conllu, there are only 39 errors fixed, but on en_ewt-ud-train.conllu, there are 7496 bugs. So maybe the authors of EWT should fix these bugs and the new version of UDPipe will be better. However, that should not be discussed here, but at https://github.com/UniversalDependencies/UD_English-EWT/issues

Originally posted by @martinpopel in https://github.com/ufal/udpipe/issues/175#issuecomment-1768050976

Shasetty commented 5 months ago

Hi sir as per your information, i used (udapy -s ud.FixPunct < in.conllu > out.conllu) still there are no changes in the parent & child relation ship (where ever it is wrong).

https://drive.google.com/drive/folders/1CbrXDDpQfx6TJrguGDGrJh0UTyCojW3L?usp=drive_link I have placed 2.10 & 2.12 input & output conullu files for your analysis in the above google drive link (as i cannot paste conllu files)


steps followed to setup:-

(Below steps are copied from Udapi github website) cd git clone https://github.com/udapi/udapi-python.git pip3 install --user -r udapi-python/requirements.txt echo '## Use Udapi from ~/udapi-python/ ##' >> ~/.bashrc echo 'export PATH="$HOME/udapi-python/bin:$PATH"' >> ~/.bashrc echo 'export PYTHONPATH="$HOME/udapi-python/:$PYTHONPATH"' >> ~/.bashrc source ~/.bashrc # or open new bash

Obtained the conllu files from Lindat UDPipe website then executed the below command

udapy -s ud.FixPunct < in.conllu > out.conllu


considered text:-

On November 29, 2022, Twin Ridge, Carbon Revolution Public Limited Company (formerly known as Poppetell Limited), a public limited company incorporated in Ireland with registered number 607450 (“MergeCo”), Carbon Revolution and Poppettell Merger Sub, a Cayman Islands exempted company and wholly-owned subsidiary of MergeCo (“Merger Sub”), entered into a Business Combination Agreement (as it may be amended or supplemented from time to time, the “Business Combination Agreement”), pursuant to which, among other things, Twin Ridge will be merged with and into Merger Sub, with Merger Sub surviving as a wholly-owned subsidiary of MergeCo (the “Merger”), with shareholders of Twin Ridge receiving ordinary shares of MergeCo, par value $0.0001 (the “MergeCo Ordinary Shares”), in exchange for their existing Twin Ridge Ordinary Shares (as defined below) and existing Twin Ridge warrant holders having their warrants automatically exchanged by assumption by MergeCo of the obligations under such warrants, including to become exercisable in respect of MergeCo Ordinary Shares instead of Twin Ridge Ordinary Shares, subject to, among other things, the approval of Twin Ridge’s shareholders.

please do help me in fixing the wrong parent & child relationship

martinpopel commented 5 months ago

First, this issue does not belong here because there is no UDPipe software bug reported. As explained at https://github.com/ufal/udpipe/issues/175#issuecomment-1768291600, we cannot expect such ridiculous sentence to be parsed without any errors by UDPipe.

If there are any issues with using udapy and ud.FixPunct, you can report them in the udapi-python repo. However, as I explain below, there is no Udapi software bug either.

I confirm you used ud.FixPunct correctly. I've obtained exactly the same results as you after running udapy -s ud.FixPunct < 2_10_input.conllu > 2_10_output.conllu.

still there are no changes in the parent & child relation ship

No. You can use e.g. diff 2_10_input.conllu 2_10_output.conllu (or vimdiff) to see there are 21 changes done by ud.FixPunct.

I don't see any errors in punctuation attachment in the output CoNLL-U files (both for 2_10_output.conllu and 2_12_output.conllu). You can use udapy write.Html < 2_10_output.conllu > 2_10_output.html to get the "js-treex-view.js" visualization. You can also highlight all nonprojective nodes using udapy -H util.Mark node='node.is_nonprojective()' < 2_10_output.conllu > 2_10_output-nonprojective.html and check that all the punctuation tokens are attached projectively (unlike in the input files).

(where ever it is wrong)

The ud.FixPunct corrects punctuation only, as the name suggests. So of course, there are still many other parsing errors left, including strange non-projectivities (e.g. the 126th token "the" and the 128th token "Merger", but these are not punctuation symbols).

foxik commented 5 months ago

Closing, as there is no bug in UDPipe.

It is expected that there will be some errors in the output -- according to https://ufal.mff.cuni.cz/udpipe/2/models#universal_dependencies_212_models, for the model english-gum-ud-2.12-230717 the UAS is 93.72, so even on the in-domain test set, we predict ~6.3% of edges incorrectly; on real data, the number of errors will probably be even larger.

Shasetty commented 1 month ago

Hi Grammaticians,

I am reopening this ticket, in a hope, to find a work around solution.

As you know my earlier ticket was closed, informing the limitation of the present software.

Now I request, you to provide a work around solution, for the text, where parent and child are wrongly connected.

Points to be considered are:-

1)grammar rules should always be followed. 2)break up the single sentence into multiple sentences. 3)run each sentence individually on 2.10 version 4)merge all the files 5)while merging change parent and child dependency relationship (as needed)