themains / dmoz_csv

Convert DMOZ content.rdf.u8.gz into a CSV file
1 stars 0 forks source link

Translate category scheme to english #3

Open soodoku opened 7 years ago

soodoku commented 7 years ago

Lots of category labels are in a language other than English

www.delphipraxis.net,Top/World/Deutsch/Computer/Programmieren/Sprachen/Delphi
iwamizawach.org,Top/World/Japanese/社会/宗教・精神世界/キリスト教/教団・教派/ペンテコステ・カリスマ派/アッセンブリーズ・オブ・ゴッド/日本アッセンブリーズ・オブ・ゴッド教団/教会/北海道

For non-english, it appears one pattern is that language is in the path: Deutsch, Japanese etc.

Perhaps use google translate to translate it? One package we could use: https://pypi.python.org/pypi/translate

Final output will have an additional column -> cat_labels_english

suriyan commented 7 years ago

Okay, I will do. By quick check there is over 200k unique labels under "Top/World/..." that will be non-English.

But seems Google Translate is limit just 1,000 words/day?

soodoku commented 7 years ago

Not sure if we have good alternatives. And it seems that Google pricing is reasonable: https://cloud.google.com/translate/v2/pricing

We can run through it one time.

suriyan commented 7 years ago

Actually, Google Translate API has the following limit :- (it's not 1,000 words/day)

By splitting each level of the category and grouping them by the language, we can get the smaller unique list of words for each language. It's about 1.5M characters so probably free quota will enough to translate it all.

suriyan commented 7 years ago

Sorry for my confusing, actually Google Translate API it's not free. But above number is quota to use this service per day per account.

Fortunately, Google give $300 credits for 60 days free trial on theirs Cloud services, so we can use this credits.

image