Process multi-word tokens (MWTs)

udapi / udapi-python

Python framework for processing Universal Dependencies data

GNU General Public License v3.0

54 stars 29 forks source link

Process multi-word tokens (MWTs) #37

Closed prokopidis closed 7 years ago

prokopidis commented 7 years ago

Hi,

I am looking for some example udapy code to convert a conllu file into a version in which certain words (like the spanish al in http://universaldependencies.org/format.html#words-tokens-and-empty-nodes) are converted to their multi-token equivalent.

Is https://github.com/udapi/udapi-python/blob/master/udapi/block/ud/splitunderscoretokens.py a good place to start or is there something more specific?

Thanks

martinpopel commented 7 years ago

I am not sure I understand the task. Spanish "al" is not a word, but a multi-word token containing words "a" and "el". Can you provide a short sample of the input and expected output data in CoNLL-U?

prokopidis commented 7 years ago

Sorry for the confusing terminology. I'm looking for a way to convert something like the following

1 al  CASE 2 
2 mar OBL 3 
3 vamos ROOT 0

into

1-2 
1 a  CASE 3 
2 el  DET 3 
3 mar OBL 4
4 vamos ROOT 0

martinpopel commented 7 years ago

OK. Do you have a list of all multi-word tokens (MWTs) you want to process? E.g. al => a el, del => de el. Or do you want to program the rules in Python, e.g.

# e.g. transmitiéndose -> transmitiéndo se
if node.form.endswith('se') and node.upos=='VERB':
    self.create_multiword_token(node,... )

Or do you have some data with correctly annotated MWTs and want a trainable tool which will try to learn the patterns and apply it to new data?

Note that in general this task is not easy because:

In same languages the MWTs (contractions) are not easy to detect (morphological changes, irregularities,...).
We need to guess the UPOS tag of the newly created words and FEATS.
In your example both words "a" and "el" have the same head word ("mar"), but this is not always the case. Sometimes one of the new words depends on the other, sometimes each new word depends on another "old" word.

prokopidis commented 7 years ago

Programming the rules in Python would be fine for me. My list of MWTs is very limited and they always split into CoNLL-U words with the same head.

martinpopel commented 7 years ago

I've added an example block for adding Czech MWTs - ud.cs.AddMwt as a subclass of language-independent base class ud.AddMwt. @prokopidis: Feel free to add a block specific for your language (as a pull request). Ask if anything is unclear, possibly reopening this issue or opening a new issue.

If your list of MWTs is limited, it will be probably enough just to redefine the table. See also a bit of explanation.

A note for anyone who wants a trainable tool: UDPipe can do this, you can run just the tokenizer (without tagger and parser).