udapi / udapi-python

Python framework for processing Universal Dependencies data
GNU General Public License v3.0
57 stars 31 forks source link

need help in fixing text and brackets to its (correct parents) #122

Closed Shasetty closed 7 months ago

Shasetty commented 7 months ago

used text : - On November 29, 2022, Twin Ridge, Carbon Revolution Public Limited Company (formerly known as Poppetell Limited), a public limited company incorporated in Ireland with registered number 607450 (“MergeCo”), Carbon Revolution and Poppettell Merger Sub, a Cayman Islands exempted company and wholly-owned subsidiary of MergeCo (“Merger Sub”), entered into a Business Combination Agreement (as it may be amended or supplemented from time to time, the “Business Combination Agreement”), pursuant to which, among other things, Twin Ridge will be merged with and into Merger Sub, with Merger Sub surviving as a wholly-owned subsidiary of MergeCo (the “Merger”), with shareholders of Twin Ridge receiving ordinary shares of MergeCo, par value $0.0001 (the “MergeCo Ordinary Shares”), in exchange for their existing Twin Ridge Ordinary Shares (as defined below) and existing Twin Ridge warrant holders having their warrants automatically exchanged by assumption by MergeCo of the obligations under such warrants, including to become exercisable in respect of MergeCo Ordinary Shares instead of Twin Ridge Ordinary Shares, subject to, among other things, the approval of Twin Ridge’s shareholders.

used parser:- version : UD2.10 model: english-ewt-ud-2.10-220711

url: https://lindat.mff.cuni.cz/services/udpipe/

text and brackets below mentioned are not getting properly attached to its parents:- issue of bracket from part of the above content :- of MergeCo (“Merger Sub”) issue in few words from the above content :- “Business Combination Agreement”)

image

please suggest me how to fix the issue

dan-zeman commented 7 months ago

This issue probably does not belong here because it is about UDPipe rather than Udapi. Anyway, I think you should try a newer model. With english-ewt-ud-2.12-230717, I got better results on this terrible sentence.

Shasetty commented 7 months ago

Thank you for answering.

Above given text is a, single paragraph. model: english-ewt-ud-2.10-220711 , considers as single para english-ewt-ud-2.12-230717 , considers as 2 para.

Even in english-ewt-ud-2.12-230717 text and brackets are not getting properly attached to its parents.

As i am using, english-ewt-ud-2.10-220711, i want a solution for the same. can you help me.

dan-zeman commented 7 months ago

A model is always just a model. It depends on the data it was trained on, and it will rarely give you 100% correct output. If you are not satisfied with the results, you can either search for a better model, or write a program that will postprocess the parser's output and fix the errors, perhaps based on some heuristics.

Shasetty commented 7 months ago

Thank you for replying.

(ewt-ud-2.10-220711) is the best model, among the online available models. Many grammar relations, mentioned in book (a-comprehensive-grammar-of-the-english-language) matches the 2.10 version output.

I am not skilled in Grammar like you people are. Further i lack knowledge of python also.

fixing issues in model (ewt-ud-2.12-230717), when raised in https://github.com/ufal/udpipe/issues/175, a solution was informed as : [udapy -s ud.FixPunct < in.conllu > out.conllu].

in comparison between (ewt-ud-2.10-220711) & (ewt-ud-2.12-230717) i found (ewt-ud-2.10-220711) version good.

I request you to provide a solution for (ewt-ud-2.10-220711).

martinpopel commented 7 months ago

As I explained in https://github.com/ufal/udpipe/issues/189#issuecomment-2059054030, udapy -s ud.FixPunct < in.conllu > out.conllu does indeed correct the wrongly (non-projectively) attached punctuation tokens (even in this ridiculously long sentence), so I don't see any Udapi-related bug here and I am closing this issue.

Note that based on your original issue, I have fixed wrongly attached punctuation in EWT (yes, using ud.FixPunct), and this was released in UD_English-EWT 2.13 in November 2023, so we just need to wait until a UDPipe model trained on UD_English-EWT 2.13 or newer is published and then I hope there will be less non-projective punctuation problems in its outputs (although I cannot guarantee zero problems, so maybe we will still need to use ud.FixPunct).