Closed asterbini closed 4 years ago
I will look into it. Thanks for the suggestion.
Added to pyate==0.4.0
, but no other languages supported and the TermExtraction.DEFAULT_GENERAL_DOMAIN = pd.read_csv(pkg_resources.resource_stream(__name__, f'default_general_domain.{language}.csv'))
line is temporarily commented out. Please let me know if you find good sentences and I will add them. I like the default_general_domain.{language}.csv
denotation and I will stick to that for now.
Here are 3000 random summaries taken from the Italian Wikipedia. I hope they are good enough, even if they are shorter than the English texts. default_general_domain.it.csv.txt
@asterbini it looks like a lot of the lines are English. Can you double check?
Sorry, I forgot to switch to Italian in:
import wikipedia
wikipedia.set_lang('it')
with open('default_general_domain.it.csv', mode='w') as F:
i = 0
while (i<3000):
try:
print(i, ',"', wikipedia.summary(wikipedia.random()).replace('"',"'"),'"', file=F)
i += 1
except:
pass
I am rebuilding the file
Here it is default_general_domain.it.csv.txt
Added, thanks @asterbini .
any chance to add Brazilian Portuguese?
@isabelafaraujo Yup, instructions can be found at https://github.com/kevinlu1248/pyate#other-languages. Feel free to ask for clarification if anything doesn't make sense.
@isabelafaraujo I looked a bit more into Brazilian Portuguese and it looks like there is only a spaCy model for European Portuguese. I'm not sure how much that will affect the performance since I'm not familiar with the differences between Brazilian and European Portuguese. A quick Google search says the differences are mostly, spoken, so I'm guessing it won't make too big of a difference in terms of performance.
Undertood! There are some differences between words too... it affects a bit, but it depends of the application
It would be very nice if pyATE could be used on texts in other languages. To this aim I have added a static method to the TermExtraction class: (but I still have to find a good file of random sentences in Italian as a general domain)