kevinlu1248 / pyate

PYthon Automated Term Extraction
https://kevinlu1248.github.io/pyate/
MIT License
305 stars 37 forks source link

pyATE for other languages? #13

Closed asterbini closed 4 years ago

asterbini commented 4 years ago

It would be very nice if pyATE could be used on texts in other languages. To this aim I have added a static method to the TermExtraction class: (but I still have to find a good file of random sentences in Italian as a general domain)

    @staticmethod
    def set_language(language: str):
        TermExtraction.nlp     = spacy.load(language)
        TermExtraction.matcher = Matcher(TermExtraction.nlp.vocab)
        TermExtraction.DEFAULT_GENERAL_DOMAIN = pd.read_csv(pkg_resources.resource_stream(__name__, f'default_general_domain.{language}.csv'))
kevinlu1248 commented 4 years ago

I will look into it. Thanks for the suggestion.

kevinlu1248 commented 4 years ago

Added to pyate==0.4.0, but no other languages supported and the TermExtraction.DEFAULT_GENERAL_DOMAIN = pd.read_csv(pkg_resources.resource_stream(__name__, f'default_general_domain.{language}.csv')) line is temporarily commented out. Please let me know if you find good sentences and I will add them. I like the default_general_domain.{language}.csv denotation and I will stick to that for now.

asterbini commented 4 years ago

Here are 3000 random summaries taken from the Italian Wikipedia. I hope they are good enough, even if they are shorter than the English texts. default_general_domain.it.csv.txt

kevinlu1248 commented 4 years ago

@asterbini it looks like a lot of the lines are English. Can you double check?

asterbini commented 4 years ago

Sorry, I forgot to switch to Italian in:

import wikipedia
wikipedia.set_lang('it')
with open('default_general_domain.it.csv', mode='w') as F:
    i = 0
    while (i<3000):
        try:
            print(i, ',"', wikipedia.summary(wikipedia.random()).replace('"',"'"),'"', file=F)
            i += 1
        except:
            pass

I am rebuilding the file

asterbini commented 4 years ago

Here it is default_general_domain.it.csv.txt

kevinlu1248 commented 4 years ago

Added, thanks @asterbini .

isabelafaraujo commented 4 years ago

any chance to add Brazilian Portuguese?

kevinlu1248 commented 4 years ago

@isabelafaraujo Yup, instructions can be found at https://github.com/kevinlu1248/pyate#other-languages. Feel free to ask for clarification if anything doesn't make sense.

kevinlu1248 commented 4 years ago

@isabelafaraujo I looked a bit more into Brazilian Portuguese and it looks like there is only a spaCy model for European Portuguese. I'm not sure how much that will affect the performance since I'm not familiar with the differences between Brazilian and European Portuguese. A quick Google search says the differences are mostly, spoken, so I'm guessing it won't make too big of a difference in terms of performance.

isabelafaraujo commented 4 years ago

Undertood! There are some differences between words too... it affects a bit, but it depends of the application