Alternate Frequency Corpuses - Suggestions welcome

sschmidTU commented 2 years ago

InnocentCampus is great, but with sourcing ~5000 novels, it's a little small (though mostly sufficient), and specific to books. Alternatives would be sources using News, Movies, Wikipedia, Twitter, etc. Ideally, we would be able to use any of these frequencies, by user choice (or insert all of them into different fields).

There's a popular Anime/J-Drama corpus -> research. (see next comment) (It ranks words differently though, 1 = most common word, 2 = 2nd most common word, etc. - inverse relationship to InnocentCampus frequency, which is number of occurences. This actually lead to confusions in forums)

It's easy enough to exchange the corpus, right now it's just a Javascript object (Hashmap) called innocent_terms_complete inserted as a global variable through a <script> tag in the index.html. It was generated via tools/parseCorpus.js. So currently frequency = innocent_terms_complete[word]. (more or less)

Note for Latin nerds: the Latin plural is corpora, but corpuses is also an allowed plural in English. I love Latin, but English is not Latin. _{^{Also, the plural of octopus is octopuses, not octopi, because it's a Greek word, not Latin, the Greek plural being octopodes. But the dictionary is generous and accepts all 3, reflecting common usage.}}

sschmidTU commented 2 years ago

Frequency Corpuses research (WIP):

The Anime/JDrama frequency corpus may be this one, using ~12000 Anime+Drama subtitle files: https://github.com/chriskempson/japanese-subtitles-word-frequency-list

This one actually has both number of occurences and frequency ranking ("nth most common word"), which is nice.

There's also one using ~200 anime shows, which is too small a sample for my tastes, but has some interesting findings ("the top ~900 kanji make 90% of kanji occurences, top ~1900 make 99%"): https://www.reddit.com/r/LearnJapanese/comments/crlsqj/googlesheet_anime_frequency_list/

patarapolw commented 2 years ago

I've just found another resource on PyPI (toiro).

from toiro import datadownloader, tokenizers

# A list of avaliable corpora in toiro
corpora = datadownloader.available_corpus()
print(corpora)
# => ['livedoor_news_corpus', 'yahoo_movie_reviews', 'amazon_reviews', 'chABSA_dataset']

available_tokenizers = tokenizers.available_tokenizers()
print(available_tokenizers.keys())
# => dict_keys(['nagisa', 'janome', 'mecab-python3', 'sudachipy', 'spacy', 'ginza', 'kytea', 'jumanpp', 'sentencepiece', 'fugashi-ipadic', 'tinysegmenter', 'fugashi-unidic'])

I am not really sure if segmenters are required, but I added them just in case.

sschmidTU commented 2 years ago

BCCWJ corpus is now added (separate URL for now, 5.8MB download): https://sschmidtu.github.io/anki-frequency-inserter/index_BCCWJ.html?expressionFieldName=Expression&frequencyFieldName=FrequencyBCCWJ

main commit: b64a447bd9ffaccd9b1b8bdebb4ad2d8d0c237ed unification: 36aba625f0a49901d74f15be57cb2ab0478d983c

This is the Balanced Corpus of Contemporary Written Japanese, which uses relative frequency (100 = 100th most common word) instead of absolute frequency like InnocentCorpus (100 = occurs 100 times in these 5000 books). (we could also convert InnocentCorpus to relative frequency via code if desired by the user)

In my experience, both corpuses have some interesting differences and common words missing, so they supplement each other very well in my Anki cards. I wrote more about it in a post on Wanikani.

@patarapolw tagging you in case you're interested.

Still open for more corpus suggestions! (The ones from PyPI sound interesting, just didn't get to take a look yet)

Currently, the BCCWJ corpus just needs its own index_BCCWJ.html (and corpus terms_BCCWJ.js), the main code is now unified in frequencyInserter.js and can be easily expanded for more corpuses.

We could unify the HTML into one page as well, but then we'd need radio buttons to choose the corpus, and more importantly to load the corpus after page load on user click, so that the user doesn't have to download all corpuses at once. (Innocent is ~1.7MB zipped, BCCWJ ~5.8MB) This might take longer for the user, and we'd need to dynamically require the corpus .js or something.

sschmidTU / anki-frequency-inserter

Alternate Frequency Corpuses - Suggestions welcome #3