Add support for Japanese-English Bilingual Corpus of Wikipedia's Kyoto Articles

red-data-tools / red-datasets

A RubyGem that provides common datasets

MIT License

30 stars 25 forks source link

Add support for Japanese-English Bilingual Corpus of Wikipedia's Kyoto Articles #135

Closed tmatsuura1 closed 2 years ago

tmatsuura1 commented 2 years ago

https://github.com/red-data-tools/red-datasets/issues/57 @kou It is now working, but I am not sure if this is really the right way to implement it. Will you please give me any comments or advice? At the moment, I am assuming that user will use it as in the example.

kou commented 2 years ago

I've pushed a commit that improves API.

TODO:

Add tests
Add support for kyoto_lexicon.csv in the archive
- We need to consider API for it
  1. WikipediaKyotoJapaneseEnglish.new(category: :lexicon) and yields Record = Struct.new(:japanese, :english)?
  2. WikipediaKyotoJapaneseEnglishLexicon.new and yields Record = Struct.new(:japanese, :english)?

tmatsuura1 commented 2 years ago

@kou Thank you so much. I want to try TODOs next weekend. I think It's difficult to decide which is better, but I think the implementation will be easier to understand with idea b and b is better. And my feeling is that the structure of the data is different, so it seems to me that it would be better to separate the classes as well.

kou commented 2 years ago

I want to try TODOs next weekend.

Great!

I reconsidered API. How about the followings?

Remove category argument from WikipediaKyotoJapaneseEnglish#initialize
- WikipediaKyotoJapaneseEnglish always yields articles in all categories
Add type argument to WikipediaKyotoJapaneseEnglish#initialize
- Available types: article and lexicon
- article: Processes */*.xml and yields WikipediaKyotoJapaneseEnglish::Article
- lexicon: Processes kyoto_lexicon.csv and yields Entry = Struct.new(:japanese, :english)

tmatsuura1 commented 2 years ago

Thank you. I think this API is good.

tmatsuura1 commented 2 years ago

@kou Could you review this commits again?

kou commented 2 years ago

We can group tests by sub_test_case like https://github.com/red-data-tools/red-datasets/blob/master/test/test-aozora-bunko.rb#L71 . Can I organize tests?

tmatsuura1 commented 2 years ago

Please organize tests.

kou commented 2 years ago

Done. Merged.

Thanks!