Open koheiw opened 6 years ago
@danimadrid great job for Spanish!
Is there anyone working on Arabic dictionary? I want to help in that.
Hi @sneetsher There is no one working on Arabic. That would be awesome!
@koheiw , Just to let you know I started translation, but I got few confusing things.
Is the words in arrays [...]
just non ordered keyword list? so I can add and remove some. Like same capital name as the country name, or capital of a county has same word of another country name (same letters when words are wrote without diacritics, which is generally the case).
Does it support 2 wildcard? In Arabic, many names could have suffix & prefix in same time. Example,
الجزائريون ال-جزائري-ون ال the جزائري Algerian ون s
If you could add some instructions in English dict as comments to help translators know better the context and the use of those word in the program.
@sneetsher great to know that you started working on the Arabic dictionary!
The principle is translating city and country names in the English master without adding or removing anything to make sure that all the language versions are comparable. If you think there are missing cities, please open a separate issue so that we can discuss and update all the languages in coordinated manner.
If there is an unsolvable ambiguity in Arabic, you should consider excluding some of the names. (We have to minimize false positive matches in semi-supervised learning). I trust your judgement, but please leave a note on your decision for the removal for future reference. I also wish to understand the problems in Arabic dictionaries.
As for wildcard, you can use multiple *. quanteda is optimized for wildcard at the end, but still works with one at the top or in the middle. However, handling of right-to-left languages is a new territory for the package, it is good to do some tests. I am more than happy to discuss with you on challenges in text analysis in right-to-left languages.
Please ask me any questions to make all crystal clear. I will then put them into an instruction for contributors in the Wiki.
@sneetsher I wrote a guideline on how to translate the English master. I hope it helps.
Yeah, That made it clear in many aspects, thank you. Excuse me, I didn't reply earlier, I don't have steady internet connection & I'm having much work with Wikipedia (same workflow as you explained) to get correct spelling.
By the way, I used same format of English as I understand it: [country, people, capital, very important cities ..]
I didn't want to upload any partial commits, but I'll put it in a github Gist. So you can follow it. (here is: https://gist.github.com/sneetsher/d5d5e17c09e84109d4c825b22df2207d)
Yes, "[country, people, capital, very important cities ..]" is the YAML format. I will write about this in the Guideline.
Russian dictionary has been added. Thank you @KT01.
If I want to create a traditional Chinese dictionary, should I add the words to the 'chinese.yaml' or make a distinction between simplified_chinese.yaml and tradtional_chinese.yaml?
Sounds great! chinese_traditional.yml
would be good as its file name. I will rename existing file to chinese_simplified.yml
later. Please try to keep them comparable (functionally equivalent). Looking forward to seeing your PR.
Hi ! I guess that we can create the french dictionnary in a reasonnable delay.
Claude
@ClaudeGrasland, amazing! Looking forward too see your pull request.
I am not quite familiar with github and yaml... Can you tell me how I can edit the english dictionnary and replace by french words ? Thank you in advance ! Claude
YAML is a text file. Please download the English master and just open in a text editor.
I discovered two issues
In french dictionary, it is better to remove "Hollande" as keyword for the country of Netherlands, because it produce a confusion with the former french president François Hollande. Application of the dictionnary on french newspaper produce a dramatic number of false positive about Netherlands.
in japanese dictionary, I noticed an unexpected number of news about Thailande when trying to test on newspaper Asahi Shimbum from 2013 to 2019. According to Kohei, it is probably not related to a real media coverage but to an ambiguity with the wildcard added to the name (タイ*). When you remove the wild card (タイ) the results seems to be more consistant with empirical knowledge on the real distribution of country's salience in international news.
P.S. As I can not read Japanese, I am not able to solve the issue with Thailand but I can send a sample of news for checking the origin of false positive
Created a separate issue #28
I will be working on the Hebrew translation
Hello, I found some issues in the Chinese simplified dictionary. I just list it here.
There are more languages need to be covered:
All the localization should be based on the English master. If you are interested, please see the guideline for translators.