koheiw / newsmap

Semi-supervised algorithm for geographical document classification
Other
60 stars 22 forks source link

Add more seed dictionaries #6

Open koheiw opened 6 years ago

koheiw commented 6 years ago

There are more languages need to be covered:

All the localization should be based on the English master. If you are interested, please see the guideline for translators.

koheiw commented 6 years ago

@danimadrid great job for Spanish!

sneetsher commented 6 years ago

Is there anyone working on Arabic dictionary? I want to help in that.

koheiw commented 6 years ago

Hi @sneetsher There is no one working on Arabic. That would be awesome!

sneetsher commented 6 years ago

@koheiw , Just to let you know I started translation, but I got few confusing things.

  1. Is the words in arrays [...] just non ordered keyword list? so I can add and remove some. Like same capital name as the country name, or capital of a county has same word of another country name (same letters when words are wrote without diacritics, which is generally the case).

  2. Does it support 2 wildcard? In Arabic, many names could have suffix & prefix in same time. Example,

    الجزائريون ال-جزائري-ون ال the جزائري Algerian ون s

If you could add some instructions in English dict as comments to help translators know better the context and the use of those word in the program.

koheiw commented 6 years ago

@sneetsher great to know that you started working on the Arabic dictionary!

The principle is translating city and country names in the English master without adding or removing anything to make sure that all the language versions are comparable. If you think there are missing cities, please open a separate issue so that we can discuss and update all the languages in coordinated manner.

If there is an unsolvable ambiguity in Arabic, you should consider excluding some of the names. (We have to minimize false positive matches in semi-supervised learning). I trust your judgement, but please leave a note on your decision for the removal for future reference. I also wish to understand the problems in Arabic dictionaries.

As for wildcard, you can use multiple *. quanteda is optimized for wildcard at the end, but still works with one at the top or in the middle. However, handling of right-to-left languages is a new territory for the package, it is good to do some tests. I am more than happy to discuss with you on challenges in text analysis in right-to-left languages.

Please ask me any questions to make all crystal clear. I will then put them into an instruction for contributors in the Wiki.

koheiw commented 6 years ago

@sneetsher I wrote a guideline on how to translate the English master. I hope it helps.

sneetsher commented 6 years ago

Yeah, That made it clear in many aspects, thank you. Excuse me, I didn't reply earlier, I don't have steady internet connection & I'm having much work with Wikipedia (same workflow as you explained) to get correct spelling.

By the way, I used same format of English as I understand it: [country, people, capital, very important cities ..]

I didn't want to upload any partial commits, but I'll put it in a github Gist. So you can follow it. (here is: https://gist.github.com/sneetsher/d5d5e17c09e84109d4c825b22df2207d)

koheiw commented 6 years ago

Yes, "[country, people, capital, very important cities ..]" is the YAML format. I will write about this in the Guideline.

koheiw commented 6 years ago

Russian dictionary has been added. Thank you @KT01.

chainsawriot commented 5 years ago

If I want to create a traditional Chinese dictionary, should I add the words to the 'chinese.yaml' or make a distinction between simplified_chinese.yaml and tradtional_chinese.yaml?

koheiw commented 5 years ago

Sounds great! chinese_traditional.yml would be good as its file name. I will rename existing file to chinese_simplified.yml later. Please try to keep them comparable (functionally equivalent). Looking forward to seeing your PR.

ClaudeGrasland commented 5 years ago

Hi ! I guess that we can create the french dictionnary in a reasonnable delay.

Claude

koheiw commented 5 years ago

@ClaudeGrasland, amazing! Looking forward too see your pull request.

ClaudeGrasland commented 5 years ago

I am not quite familiar with github and yaml... Can you tell me how I can edit the english dictionnary and replace by french words ? Thank you in advance ! Claude

koheiw commented 5 years ago

YAML is a text file. Please download the English master and just open in a text editor.

ClaudeGrasland commented 5 years ago

I discovered two issues

  1. In french dictionary, it is better to remove "Hollande" as keyword for the country of Netherlands, because it produce a confusion with the former french president François Hollande. Application of the dictionnary on french newspaper produce a dramatic number of false positive about Netherlands.

  2. in japanese dictionary, I noticed an unexpected number of news about Thailande when trying to test on newspaper Asahi Shimbum from 2013 to 2019. According to Kohei, it is probably not related to a real media coverage but to an ambiguity with the wildcard added to the name (タイ*). When you remove the wild card (タイ) the results seems to be more consistant with empirical knowledge on the real distribution of country's salience in international news.

ClaudeGrasland commented 5 years ago

P.S. As I can not read Japanese, I am not able to solve the issue with Thailand but I can send a sample of news for checking the origin of false positive

koheiw commented 5 years ago

Created a separate issue #28

eladseg commented 5 years ago

I will be working on the Hebrew translation

aseiiss commented 2 years ago

Hello, I found some issues in the Chinese simplified dictionary. I just list it here.

  1. 'CF': [中非共和国, 中非, 班吉]. The 中非 is a term used in a general context on Sino-African relation rather than a specific argument on the Central African Republic. The Current version capture so many CF because of this issue. I think it is better to omit '中非'.
  2. 'MN': [蒙古, 乌兰巴托]. '蒙古' would capture Inner Mongolia Autonomous Region when user uses a domestic new papers. The Current version capture so many MN because of this issue. I believe it is better to use '蒙古国' instead of '蒙古'. I am a beginner of GitHub, so just post it here, Thanks.