interrogator / corpkit

A toolkit for corpus linguistics
Other
199 stars 27 forks source link

Multilingual support #12

Open interrogator opened 8 years ago

interrogator commented 8 years ago

corpkit is currently oriented toward English, but nothing stops at least some features from being extended to other languages. I should be able to get around to the basics (encodings, as well as multilingual tokenisation) soon.

NetBUG commented 8 years ago

I tried to collect some tools in very raw mode into Wiki: https://github.com/interrogator/corpkit/wiki/Multilanguage-tokenization-tools

If you think it's useful, I can extend it to verify certain selection of languages, or to create a single huge matrix with progress for each language by technology (tokenization - stemming - tagging - syntactic parsing).

interrogator commented 8 years ago

Ah, great start. Thanks so much, @NetBUG. All my work is currently in English, so I find it hard to stay up to date on resources for other languages.

I suppose building a list of resources would be useful, as we could just have a column for 'implemented in corpkit?', so that it could also be used as documentation for end-users.

As for how to implement these things in the actual code, I'm a little less certain. In the current version of the GUI, there is a 'Preferences' popup. I was thinking I could start by just adding 'Language' to that, and allow the user to select any language there are dedicated resources for. Then, I suppose we could just have a dict of {language: {'parser': function1, 'lemmatiser': function2}} that specified which tokeniser/lemmatiser/tagger/parser should be called.

Currently, the data types that can be searched are plain text, tokens, trees and dependencies. A possible issue is that lemmatisation in most languages is not possible without knowing parts of speech, but in corpkit, POS tagging is only done by CoreNLP, which has limited multilingual support.

My idea to solve this is to add dedicated POS tagging (which for English could be done also via CoreNLP). Right now, there is a 'parse' and 'tokenise' button. Perhaps another needs to be added, 'POS tag', which creates a list of tuples (word, pos-tag). I imagine a lot of language now have POS taggers, but not full parsers.

Thoughts?

NetBUG commented 8 years ago

I tried to make the list of utilities that can be used. Fortunately, number of utilities of greater than number of interfaces used.

I agree about settings (each language can have its own toolkit). The toolkit should include tokenizer, stemmer, ideally - lemmatizer and morphological and syntactic parsers. Language should be a feature of a corpus, not program environment (to say nothing about bilingual corpora, which are complicated to handle; I think if I ever need them, I'll make a preprocessor to split into two plus some viewing tools to find parallel sentences). Anyway it isn't the urgent task, I think.

Sure, good lemmatization is impossible out of context. While morphological features differ largely between languages (and parsers), they still have a single key feature: each word has a set of properties. In English where words aren't declined, it's quite simple (a 2-3 letter POS tag, adopted from PennTreeBank, e.g. cartoons: NNS). In German, for example, the tag will be bigger to include part of speech, immutable properties (gender) and variable ones (case, number).

Syntactic parsing (dependency tree) varies largely from parser to parser; you can either treat it as block of text assigned to each sentence, or try to process the trees, brackets and/or any other output. I think, from your simplistic and robust approach, it's the researcher who must know parser output, and Corpkit should only provide text search over it.

Do you want me to make common interface for language-specific utilities? However, it might be an epic fail and require significant redesign, say, for Oriental languages.

interrogator commented 8 years ago

@NetBUG You're right. Language can be a feature of a corpus. Project settings are stored in settings.ini for each project ... this could easily contain a list of tuples for (corpus, language), just as it already stores (corpus, speakers). Upon opening, any language dependent features are switched to that language via the dict object mentioned in my post above. I think the easiest way would be a popup when the user hits 'Add corpus' --- I can do that. We'd just have to write a bunch of wrappers that made sure any additional parsers, stemmers etc were called exactly like the existing ones, and gave the exact equivalent output.

Currently, when CoreNLP isn't detected, it's downloaded. We could reuse that code to download the Russian/Estonian stuff. Shouldn't be too tough, now that CoreNLP is downloading alright.

I can see what you mean by the fact that the annotation will simply be a list of words and their properties. A serious issue though is that we don't have a query language for searching this morphologically annotated data. What resources already exist for interrogating text that has been marked up in this way? Is there something we could use out of the box that saves us writing ten search functions for 'match token by pos', 'match pos by token', 'match token by lemma', etc etc? Have you ever used CQL? We could perhaps use that for this morpho data, but retain the existing search types for the English dependencies.

Actually, I'm thinking, how about we make a 'multilingual' branch of corpkit to hack away on, as it won't be ready for some time. Also, how are you finding the code? Just let me know what needs more comments and I'll go back and do it!

NetBUG commented 8 years ago

Sure I heard about CQL, although it was about 7 years ago. There are many changes in how it's processed. Meanwhile I see that manatee development has diverged. I used the release from 2008 by Uni of Brno team, while it seems to be a proprietary development by a British company. I am just a bit afraid of following syntax of a language that is updated incrementally, without certain versions and releases (i.e. unlike TMX, where there is version 1.4 that is standard and you are sure that any translation memory of that version can be imported, CQL might become incompatible over releases).

interrogator commented 8 years ago

Newer CoreNLP releases have better support for German and French, nothing integrated into corpkit yet though.

interrogator commented 7 years ago

@NetBUG A short update on this.

Recently I decided to deprecate any searching of data that is not in a CONLL-U-like format. Now, parsing or tokenising puts the texts results in CONLL-U. The form is lightweight, human-editable, extensible and in use elsewhere. The main advantage is that now, the same code (corpkit/conll.py) does all the searching, no matter if your text is parsed or not. The only difference is that you can't search parse trees, governor, dependents etc., because they don't exist.

This has multilingual implications. Right now, a person can do:

corpus.tokenise(lang=en)

which will select the English NLTK tokeniser, the English WordNet Lemmatiser and the English NLTK POS tagger. It should be very clear from looking at corpkit/tokenise.py how multilingual annotators could simply be added to the dicts from which annotators are selected:

def plaintext_to_conll(inpath, postag=False, lemmatise=False,
                       lang='en', metadata=False, outpath=False,
                       nltk_data_path=False, speaker_segmentation=False):
    """
    Take a plaintext corpus and sent/word tokenise.

    :param inpath: The corpus to read in
    :param postag: do POS tagging?
    :param lemmatise: do lemmatisation?
    :param lang: choose language for pos/lemmatiser (not implemented yet)
    :param metadata: add metadata to conll (not implemented yet)
    :param outpath: custom name for the resulting corpus
    :param speaker_segmentation: did the corpus has speaker names?
    """

    import nltk
    import shutil
    import pandas as pd
    from corpkit.process import saferead

    fps = get_filepaths(inpath, 'txt')

    # IN THE SECTIONS BELOW, WE COULD ADD MULTILINGUAL
    # ANNOTATORS, PROVIDED THEY BEHAVE AS THE NLTK ONES DO
    tokenisers = {'en': nltk.word_tokenize}
    tokeniser = tokenisers.get(lang, nltk.word_tokenize)

    if lemmatise:
        from nltk.stem.wordnet import WordNetLemmatizer
        lmtzr = WordNetLemmatizer()
        lemmatisers = {'en': lmtzr}
        lemmatiser = lemmatisers.get(lang, lmtzr)

    if postag:
        # nltk.download('averaged_perceptron_tagger')
        postaggers = {'en': nltk.pos_tag}
        tagger = postaggers.get(lang, nltk.pos_tag)

@NetBUG, Do you know of which functions/methods I'd need to add here to support some of the other languages you've mentioned?