ufal / udpipe

UDPipe: Trainable pipeline for tokenizing, tagging, lemmatizing and parsing Universal Treebanks and other CoNLL-U files
Mozilla Public License 2.0
359 stars 75 forks source link

Brackets (punct) are not properly tagged to its heads in show tables (english-ewt-ud-2.12-230717) #175

Closed Shasetty closed 11 months ago

Shasetty commented 11 months ago

Text : MALVERN, Pa., Aug. 09, 2023 (GLOBE NEWSWIRE) -- Galera Therapeutics, Inc. (Nasdaq: GRTX), a clinical-stage biopharmaceutical company focused on developing and commercializing a pipeline of novel, proprietary therapeutics that have the potential to transform radiotherapy in cancer, today announced that it has received a Complete Response Letter (CRL) from the U.S.Food and Drug Administration (FDA) regarding the Company’s New Drug Application (NDA) for avasopasem manganese (avasopasem) for radiotherapy-induced severe oral mucositis (SOM) in patients with head and neck cancer undergoing standard-of-care treatment.

correct output in "show trees"

wrong outputs in "show tables" & output text : (FDA) , (NDA) https://lindat.mff.cuni.cz/services/udpipe/

martinpopel commented 11 months ago

I confirm the right brackets following FDA and NDA are attached to a wrong parent (i.e. not to FDA and NDA, respectively), when parsing this very long sentence with english-ewt-ud-2.12-230717. You can use udapy -s ud.FixPunct < in.conllu > out.conllu to fix it.

However, the output in "Show Trees" is exactly the same as in "Show Table" (and as the CoNLL-U in "Output Text"), so there is no bug in UDPipe. These GitHub issues are for reporting bugs in the software. You cannot expect 100% parsing accuracy from all models.

BTW: When using e.g. the english-gum-ud-2.12-230717 model, the brackets enclosing FDA and NDA are attached correctly. This suggest GUM is better training data then EWT in this aspect. Indeed, when applying ud.FixPunct on en_gum-ud-train.conllu, there are only 39 errors fixed, but on en_ewt-ud-train.conllu, there are 7496 bugs. So maybe the authors of EWT should fix these bugs and the new version of UDPipe will be better. However, that should not be discussed here, but at https://github.com/UniversalDependencies/UD_English-EWT/issues

foxik commented 11 months ago

Thanks @martinpopel for your detailed answer :blush:

@Shasetty UDPipe is a statistical tool, so its performance depends both on (a) its ability to effectively train on the UD training data and correctly generalizing on user inputs, and (b) the correctness of the training data. It is expected that it makes errors, but we cannot easily fix them one by one (so it makes little use to report them to us); but you can definitely try improving the training datain the repository @martinpopel suggested.