udapi / udapi-python

Python framework for processing Universal Dependencies data
GNU General Public License v3.0
54 stars 29 forks source link

Process multi-word tokens (MWTs) #37

Closed prokopidis closed 7 years ago

prokopidis commented 7 years ago

Hi,

I am looking for some example udapy code to convert a conllu file into a version in which certain words (like the spanish al in http://universaldependencies.org/format.html#words-tokens-and-empty-nodes) are converted to their multi-token equivalent.

Is https://github.com/udapi/udapi-python/blob/master/udapi/block/ud/splitunderscoretokens.py a good place to start or is there something more specific?

Thanks

martinpopel commented 7 years ago

I am not sure I understand the task. Spanish "al" is not a word, but a multi-word token containing words "a" and "el". Can you provide a short sample of the input and expected output data in CoNLL-U?

prokopidis commented 7 years ago

Sorry for the confusing terminology. I'm looking for a way to convert something like the following

1 al  CASE 2 
2 mar OBL 3 
3 vamos ROOT 0

into

1-2 
1 a  CASE 3 
2 el  DET 3 
3 mar OBL 4
4 vamos ROOT 0
martinpopel commented 7 years ago

OK. Do you have a list of all multi-word tokens (MWTs) you want to process? E.g. al => a el, del => de el. Or do you want to program the rules in Python, e.g.

# e.g. transmitiéndose -> transmitiéndo se
if node.form.endswith('se') and node.upos=='VERB':
    self.create_multiword_token(node,... )

Or do you have some data with correctly annotated MWTs and want a trainable tool which will try to learn the patterns and apply it to new data?

Note that in general this task is not easy because:

prokopidis commented 7 years ago

Programming the rules in Python would be fine for me. My list of MWTs is very limited and they always split into CoNLL-U words with the same head.

martinpopel commented 7 years ago

I've added an example block for adding Czech MWTs - ud.cs.AddMwt as a subclass of language-independent base class ud.AddMwt. @prokopidis: Feel free to add a block specific for your language (as a pull request). Ask if anything is unclear, possibly reopening this issue or opening a new issue.

If your list of MWTs is limited, it will be probably enough just to redefine the table. See also a bit of explanation.

A note for anyone who wants a trainable tool: UDPipe can do this, you can run just the tokenizer (without tagger and parser).