mortii / anki-morphs

A MorphMan fork rebuilt from the ground up with a focus on simplicity, performance, and a codebase with minimal technical debt.
https://mortii.github.io/anki-morphs/
Mozilla Public License 2.0
52 stars 7 forks source link

Frequency.csv format #108

Closed HQYang1979 closed 7 months ago

HQYang1979 commented 8 months ago

Do I have to follow exact the same format?

image

Can it be like this?

image

It's quite difficult to have exact same format for all other languages.

I know we can generate our own frequency.csv, but again it is difficult to have a corpus that can be used for generation.

No other languages like Japanese....

Vilhelm-Ian commented 8 months ago

@HQYang1979 it's entirely possible to use spacy to convert an already existing frequency list to a frequency list in the format that anki-morphs requires. Please be patient for the spacy update

mortii commented 8 months ago

I know we can generate our own frequency.csv, but again it is difficult to have a corpus that can be used for generation.

@HQYang1979 Do you mean that it is hard to find a corpus that has .txt files?

HQYang1979 commented 8 months ago

@HQYang1979 it's entirely possible to use spacy to convert an already existing frequency list to a frequency list in the format that anki-morphs requires. Please be patient for the spacy update

good to know

HQYang1979 commented 8 months ago

I know we can generate our own frequency.csv, but again it is difficult to have a corpus that can be used for generation.

@HQYang1979 Do you mean that it is hard to find a corpus that has .txt files?

yes, I cannot find spanish and English subtitle corpus like the one you recomend for Japanese

mortii commented 8 months ago

yes, I cannot find spanish and English subtitle corpus like the one you recomend for Japanese

Yeah, it's not always very easy. At some point I should probably find good sources and add more frequency files to the guide that can be downloaded.

Hugging face and other machine learning communities have large collections of text that can be used, e.g: https://huggingface.co/datasets/mc4 has terabytes of text, but it's not very convenient to download and use.

HQYang1979 commented 8 months ago

I guess I have to make my own frequency list based on the Routledge series.

May I ask how do the morphemizer reconcile situations like these: image 知る、しる、知ている。。。 Or in English: Write, Wrote, Written, Writing

Can I do something like this:

image

HQYang1979 commented 8 months ago

I deliberately added some words to the top of the frequency list, but these words do not show up first in the review. I have no idea why, I should be getting the format correct. image jp_frequency.csv And when I use this frequency list for English, I got an error message. image en_frequency.csv

Vilhelm-Ian commented 8 months ago

@HQYang1979 you are format is incorrect. The frequency list should contain only two rows. One for the base form and one for the lemmatized form. So you should do estudiar, estudia estudiar, estudie

Vilhelm-Ian commented 8 months ago

@HQYang1979 the format for the english frequency list should be correct. Can you share the error you got

HQYang1979 commented 8 months ago

@HQYang1979 you are format is incorrect. The frequency list should contain only two rows. One for the base form and one for the lemmatized form. So you should do estudiar, estudia estudiar, estudie

Yes, I wanted to ask if I can have more than two rows.

HQYang1979 commented 8 months ago

@HQYang1979 the format for the english frequency list should be correct. Can you share the error you got

This is the steps: image

The error happens on during Saving ankimorphs.db.... image

image

Anki 23.12.1 (1a1d4d54) (ao) Python 3.9.15 Qt 6.6.1 PyQt 6.6.1 Platform: Windows-10-10.0.23606

Traceback (most recent call last): File "aqt.taskman", line 142, in _on_closures_pending File "aqt.taskman", line 86, in File "aqt.taskman", line 106, in wrapped_done File "aqt.operations", line 252, in wrapped_done File "C:\Users\Administrator\AppData\Roaming\Anki2\addons21\472573498\recalc.py", line 847, in _on_failure raise error File "concurrent.futures.thread", line 58, in run File "aqt.operations", line 242, in wrapped_op File "C:\Users\Administrator\AppData\Roaming\Anki2\addons21\472573498\recalc.py", line 78, in _recalc_background_op _update_cards_and_notes(am_config) File "C:\Users\Administrator\AppData\Roaming\Anki2\addons21\472573498\recalc.py", line 297, in _update_cards_and_notes morph_priority: dict[str, int] = _get_morph_priority(am_db, config_filter) File "C:\Users\Administrator\AppData\Roaming\Anki2\addons21\472573498\recalc.py", line 512, in _get_morph_priority morph_priority = _get_morph_frequency_file_priority( File "C:\Users\Administrator\AppData\Roaming\Anki2\addons21\472573498\recalc.py", line 558, in _get_morph_frequency_file_priority key = row[0] + row[1] IndexError: list index out of range

===Add-ons (active)=== (add-on provided name [Add-on folder, installed at, version, is config changed]) AJT Browser Play Button ['182970692', 2023-11-03T10:39, 'None', mod] AJT Merge Notes ['1425504015', 2023-11-03T10:47, 'None', mod] Advanced Browser ['874215009', 2023-10-21T22:34, 'None', ''] AnkiConnect ['2055492159', 2023-10-30T01:44, 'None', mod] Batch Editing ['291119185', 2023-10-26T08:38, 'None', ''] Customize Keyboard Shortcuts ['24411424', 2023-11-01T17:17, 'None', mod] Edit Field During Review Cloze ['385888438', 2023-11-01T12:53, '6.17', mod] FSRS4Anki Helper ['759844606', 2023-12-27T15:23, 'None', mod] Fast Word Query 3 ['1956435337', 2023-11-07T23:48, 'None', ''] Google Translate ['1536291224', 2023-11-02T04:25, 'None', mod] Incremental Reading v4119 unofficial clone ['999215520', 2023-11-23T01:54, 'None', ''] Migaku Anki Add-on ['1846879528', 2023-12-26T09:08, 'None', mod] Review Heatmap ['1771074083', 2022-06-30T09:43, 'None', ''] ankimorphs-alpha ['472573498', 2023-12-16T04:35, 'None', mod]

===IDs of active AnkiWeb add-ons=== 1425504015 1536291224 1771074083 182970692 1846879528 1956435337 2055492159 24411424 291119185 385888438 472573498 759844606 874215009 999215520

===Add-ons (inactive)=== (add-on provided name [Add-on folder, installed at, version, is config changed])

mortii commented 8 months ago

@HQYang1979 are you using spaCy? if not then the english frequency file will cause en error since the content in the two columns should be the same. Only when spacy is used should the two columns be different.

HQYang1979 commented 8 months ago

@HQYang1979 are you using spaCy? if not then the english frequency file will cause en error since the content in the two columns should be the same. It's dumb and unintuitive, I know.

I changed the two rows to be the same, the same error remains. en_frequency.csv

None of the frequency list I made for other languages works; I am not sure the Japanese one works either.

I was planning to make my own frequency list by typing the words myself. Making a list from one's own corpus is always biased, that’s why I tried to make my own list based on some well-known list out there.

mortii commented 8 months ago

@HQYang1979 weird, I'll take a look later.

HQYang1979 commented 8 months ago

@mortii

@HQYang1979 are you using spaCy? if not then the english frequency file will cause en error since the content in the two columns should be the same. Only when spacy is used should the two columns be different.

I am still getting the same error message now using spaCy with two different columns.

mortii commented 8 months ago
  1. I went to this website: https://www.corpusdata.org/formats.asp and downloaded the "Linear text COCA: 8.9 mw" sample corpus
  2. Using the frequency file generator with the 'spaCy: en_core_web_sm' morphemizer I generated this file: en-frequency-coca.csv
  3. Selecting that file in the settings for the english deck successfully sorted the cards in the order of that frequency file.

I am still getting the same error message now using spaCy with two different columns.

Are you manually creating these frequency files? If so then then the morphs might be different from what the morphemizer produces and recognizes, and then everything might break.

HQYang1979 commented 8 months ago

Are you manually creating these frequency files? If so then then the morphs might be different from what the morphemizer produces and recognizes, and then everything might break.

Yes, I manually created these frequency files. The reason is of prioritizing learning, I think it is a more targeting learning method than just generating from a corpus.

mortii commented 8 months ago

Yes, I manually created these frequency files. The reason is of prioritizing learning, I think it is a more targeting learning method than just generating from a corpus.

Completely understandable, and it is theoretically possible to do as long as the morphs match what the morphemizer finds and recognizes.

I deliberately added some words to the top of the frequency list, but these words do not show up first in the review. I have no idea why, I should be getting the format correct. image

Reordering should also work fine. Maybe the cards don't show up first because there are other morphs on the cards that give the cards a higher difficulty? If you create cards with just those morphs then they should show up first.

Can I do something like this: image

No, two columns only. You would have to do something like this:

...
write, wrote
write, written
write, write
....
mortii commented 8 months ago

Yes, I manually created these frequency files. The reason is of prioritizing learning, I think it is a more targeting learning method than just generating from a corpus.

@HQYang1979 I thought of a way you could do this relatively easily.

  1. I created a write-conjugations.txt file in a folder called english, that looks like this:
    write
    wrote
    written
    writing
  2. I used the frequency file generator with the morphemizer: spaCy: en_core_web_sm on the english folder and got this frequency file:
    Morph-base,Morph-inflected
    write,writing
    write,write
    write,wrote
    write,written

Now you know the format that the morphemizer recognizes these words, and you can put them into other frequency files if you want.

You could theoretically do this with all frequency word lists, but that might give you some false-positives since the words are not in context, so I'm not sure I recommend it.

HQYang1979 commented 8 months ago

thank you for your thought, I'll look into it.

mortii commented 7 months ago

@HQYang1979 do frequency files work for you now?

HQYang1979 commented 7 months ago

@HQYang1979 do frequency files work for you now?

I find it not convenient and stop using the frequency files.

mortii commented 7 months ago

@HQYang1979 Ok, thank you for the feedback :+1:

github-actions[bot] commented 6 months ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.