Question: Best practices for converting OntoNotes to UD

boyboytemp commented 3 years ago

What are the current best practices for converting OntoNotes 5.0 to UD format? I didn't find any documentation or issues about this, sorry if it was already asked. I used this description of EWT conversion as a basic guidance.

There are multiple preprocessors:

edu.stanford.nlp.trees.treebank.OntoNotesUDUpdater It seems to filter many broken sentences (around 17k).
Also, I found a common tool for correction of Penn Treebanks in edu.stanford.nlp.trees.Treebanks. Does it make sense to invoke this function after OntoNotesUDUpdater?
Anything else?

After that I apply:

edu.stanford.nlp.trees.ud.UniversalDependenciesConverter
edu.stanford.nlp.trees.ud.UniversalDependenciesFeatureAnnotator

The following fields are filled after that: FORM, LEMMA, UPOSTAG, FEATS, HEAD, DEPREL. I didn't find a tool to add original sentence text to the final Conllu file, and information about token spacing. Any clues for these ones? I found scripts that were used to add SpaceAfter to EWT, but it seems it cannot be applied to OntoNotes.

Postprocessing:

There is UniversalEnhancer that can be used for any language. Can I use pretrained fasttext embeddings in this tool? Or do I need some special embeddings?
anything else?

example of a script:

```bash #!/usr/bin/env bash convert (){ local fname="$1" local part=${fname#onto.} for f in $(<$fname) ; do rm -f onto_fixed temp_tree temp_ud if [ -n "$MK_CRCT" ]; then java -cp "$CORENLP_HOME/*" -mx5g edu.stanford.nlp.trees.treebank.OntoNotesUDUpdater \ $f > onto_fixed 2>> "$OUT_DIR"/fixer.log f=onto_fixed fi java -cp "$CORENLP_HOME/*" -mx5g edu.stanford.nlp.trees.Treebanks \ -correct -pennPrint $f \ > temp_tree 2>> "$OUT_DIR"/correct.log java -cp "$CORENLP_HOME/*" -mx5g edu.stanford.nlp.trees.ud.UniversalDependenciesConverter \ -outputRepresentation enhanced++ -treeFile temp_tree \ > temp_ud 2>> "$OUT_DIR"/convert-1.log java -cp "$CORENLP_HOME/*" -mx5g edu.stanford.nlp.trees.ud.UniversalDependenciesFeatureAnnotator \ temp_ud temp_tree \ >> "$OUT_DIR"/$part.conllu 2>> "$OUT_DIR"/convert-2.log done # see https://github.com/stanfordnlp/CoreNLP/issues/1132 java -cp "$CORENLP_HOME/*" -mx5g edu.stanford.nlp.trees.ud.UniversalEnhancer \ -conlluFile "$OUT_DIR"/$part.conllu \ -relativePronouns "that|which|who|whom|whose|where|That|Which|Who|Whom|Whose|Where" \ > "$OUT_DIR"/$part.conllu.enhanced 2> "$OUT_DIR"/enhance.log rm "$OUT_DIR"/$part.conllu && mv "$OUT_DIR"/$part.conllu.enhanced "$OUT_DIR"/$part.conllu } [ -z "$ONTO_DIR" ] && ONTO_DIR="/path/to/onto" [ -z "$CORENLP_HOME" ] && CORENLP_HOME="/path/to/corenlp" OUT_DIR="$1" if [ -z "$OUT_DIR" ]; then echo "Pass out_dir as first argument" exit 3 fi mkdir -p "$OUT_DIR" #creaet abs path OUT_DIR=$(cd "$1"; pwd) rm -f "$OUT_DIR"/*.conllu MK_CRCT="$2" echo "Convert to $OUT_DIR with MK_CRCT=$MK_CRCT" pushd "$ONTO_DIR"/data/files/data/english/annotations find . -name *.parse > onto java -cp "$CORENLP_HOME/*" -mx5g edu.stanford.nlp.parser.tools.OntoNotesFilePreparation onto convert onto.train convert onto.dev convert onto.test popd ```

AngledLuffa commented 3 years ago

The PTB corrector was only intended for the PTB, not OntoNotes. You could always try diffing the two lines to see if there is any difference, and if so, if it's a beneficial difference. In some cases, the errors corrected may have been universal, and in others they were very specific to mislabeled PTB trees.

I don't believe there's a way to include any of the useful metadata, such as sentence number, original text, etc. I don't envision being able to extract SpaceAfter in a way that is guaranteed to be correct, since the space information is lost when the text was tokenized and turned into trees, but you may be able to get most of the way there with some general heuristics. Without that, of course, the original text annotation would not be correct either.

If you don't provide any embeddings, it should work fine. It should also work fine with any embeddings you provide.

One thing to note is that there have been a ton of updates to the lemmas in the UD EWT dataset. With that in mind, you may want to review some of the lemmas produced by this process before assuming they are correct. Ideally the lemmatizer would have had some of these lemma fixes included, but that hasn't happened yet

boyboytemp commented 3 years ago

Thank you for your response! I saw those great changes in the UD EWT. I guess it was done with some bash and manual checking. We can try to replicate these corrections but given the size of Ontonotes it can be a bit difficult.

Ideally the lemmatizer would have had some of these lemma fixes included

It would be awesome!

AngledLuffa commented 2 years ago

I have since updated the lemmatizer to incorporate many of the fixes in EWT, although it is still not 100% the same

stanfordnlp / CoreNLP

Question: Best practices for converting OntoNotes to UD #1178