mortii / anki-morphs

A MorphMan fork rebuilt from the ground up with a focus on simplicity, performance, and a codebase with minimal technical debt.
https://mortii.github.io/anki-morphs/
Mozilla Public License 2.0
47 stars 6 forks source link

Generator error for some of my txt files #232

Closed tanhoaian01 closed 1 month ago

tanhoaian01 commented 1 month ago

Describe the bug

Hello, I am trying to create frequency lists from my Mandarin Companion graded reader but it appears that some of my txt files has corrupted Anki. I notice that such files are originally epub that I downloaded online and converted to txt via Calibre and did final formatting in Notepad+++. The non-error ones share the same workflow but they are initially pdf. I open them in default Notepad and they look just fine, no idea what causes them to break ankimorph. Can you please take a look and direct me on the solution? Thanks in advanced.

Steps to reproduce the behavior

  1. Download the txt files ⭐here
  2. Go to Anki. Hit Ctrl + Shif + G to open generator window
  3. Select the folder where txt files are stored. Load files.
  4. Select "AnkiMorphs: Chinese" morphemizer
  5. Select .txt format
  6. Select "Ignore names found in name.txt" in preprocess
  7. Click on generate

Expected behavior

Able to generate readability report, frequency list and study plan from all of txt files.

Screenshots

ankimorh_generator

Debug info

Anki 24.04.1 (ccd9ca1a) (ao) Python 3.9.18 Qt 6.6.2 PyQt 6.6.1 Platform: Windows-10-10.0.19042

Traceback (most recent call last): File "aqt.taskman", line 142, in _on_closures_pending File "aqt.taskman", line 86, in File "aqt.taskman", line 106, in wrapped_done File "aqt.operations", line 252, in wrapped_done File "C:\Users\anhph\AppData\Roaming\Anki2\addons21\472573498\generators_window.py", line 266, in _on_failure raise error File "concurrent.futures.thread", line 58, in run File "aqt.operations", line 242, in wrapped_op File "C:\Users\anhph\AppData\Roaming\Anki2\addons21\472573498\generators_window.py", line 296, in _background_generate_report self._generate_morph_occurrences_by_file() File "C:\Users\anhph\AppData\Roaming\Anki2\addons21\472573498\generators_window.py", line 364, in _generate_morph_occurrences_by_file generators_text_processing.create_file_morph_occurrences( File "C:\Users\anhph\AppData\Roaming\Anki2\addons21\472573498\generators_text_processing.py", line 43, in create_file_morph_occurrences for line in file: File "codecs", line 322, in decode UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 0: invalid start byte

===Add-ons (active)=== (add-on provided name [Add-on folder, installed at, version, is config changed]) AJT Browser Play Button ['182970692', 2023-11-03T09:39, 'None', ''] AnkiConnect ['2055492159', 2024-02-27T11:37, 'None', ''] AnkiMorphs ['472573498', 2024-04-16T20:05, 'None', mod] AnkiRestart - Quick Aniki Rebooter for Customize Develop Created by Shige ['237169833', 2024-02-06T20:21, 'None', mod] PassFail 2 Remove the Easy and Hard buttons ['876946123', 2023-01-24T08:59, 'None', ''] Review Heatmap ['1771074083', 2022-06-30T08:43, 'None', ''] Yomichan Forvo Server ['580654285', 2023-08-31T03:53, 'None', mod] add-on dialog searchfilter bar ['561945101', 2023-10-18T23:22, 'None', ''] ankimorphs-chinese-jieba ['1857311956', 2024-03-25T21:52, 'None', '']

===IDs of active AnkiWeb add-ons=== 1771074083 182970692 1857311956 2055492159 237169833 472573498 561945101 580654285 876946123

===Add-ons (inactive)=== (add-on provided name [Add-on folder, installed at, version, is config changed]) AJT Merge Notes ['1425504015', 2024-03-18T22:03, 'None', mod] Advanced Copy Fields Qt6 ['287110490', 2023-11-11T00:52, 'None', ''] Anki Deck Browser Uncapped ['3662229', 2024-03-20T23:38, 'None', ''] Anki Simulator ['817108664', 2023-11-07T00:26, 'None', ''] Anki Word Frequency ['ankiwordfreq', 0, 'None', mod] Auto Sync ['501542723', 2023-11-20T23:14, 'None', ''] Chinese Prestudy ['882364911', 2024-01-05T12:07, 'None', ''] Chinese Support 3 ['1752008591', 2024-02-25T16:37, 'None', mod] Color Confirmation ['1084228676', 2024-01-06T16:55, 'None', ''] Deadline2 ['723639202', 2024-02-26T10:38, 'None', mod] Deadliner Deadline Countdown for Exams ['1560797112', 2024-01-28T22:46, 'None', ''] FSRS4Anki Helper ['759844606', 2024-05-10T21:01, 'None', mod] Filtered deck from browser selection - study it in browser order ['127393092', 2023-03-20T01:02, 'None', ''] Google Translate ['1536291224', 2024-05-08T21:16, 'None', mod] Hanzi Stats ['181243826', 2023-12-04T05:34, 'None', mod] HyperTTS - Add speech to your flashcards ['111623432', 2024-04-16T05:11, 'None', mod] Image Occlusion Enhanced ['1374772155', 2022-04-09T14:15, 'None', ''] More Decks Stats and Time Left ['1556734708', 2024-04-11T17:35, 'None', ''] Multi Deck Importer Fixed by Shige ['1563006742', 2024-02-17T07:33, 'None', ''] Quizlet to Anki 21 Importer with audio support ['1362209126', 2024-01-16T01:55, 'None', ''] Replay buttons on card ['498789867', 2017-11-20T19:38, 'None', ''] Search Stats Extended ['1613056169', 2024-03-29T06:05, 'None', ''] True Retention ['613684242', 2017-11-20T03:43, 'None', ''] ignoreCase - Insensitive type field ['1371444066', 2023-08-08T03:03, 'None', '']

My setup

xofm31 commented 1 month ago

Looking at the files in the Google Drive, all of them are utf-8 except for the Ransom of the Red Chief, which is ISO-8859. I was able to generate a readability report for all of the other files. Once I downloaded the Red Chief file, I wasn't able to view it as a text file or import it into Pages. I also didn't find a way to read it with python.

So I think the fix here is to get your files into utf-8, which will mean that AnkiMorphs can read the file, and you'll also be able to open it with a normal editor.

Looking online, it looks like Notepad++ may have an option to "Save As..." and choose "Unicode" under "Encoding". If that doesn't work, I made a copy of that file in my Google Drive, Right-click, and Open with Google Docs. Then File... Download... Plain Text (txt). I was able to view that .txt file, and AnkiMorphs was able to read it and generate the readability report.

tanhoaian01 commented 1 month ago

Thank you for help! Making a copy, opening with Google Docs and downloading as plain text fixes the issue for me. Your addon makes learning Chinese much more insteresting. Keep up the great work!

mortii commented 1 month ago

Nice catch @xofm31, the generator assumes utf-8 encoding: https://github.com/mortii/anki-morphs/blob/ff8c10532beb39c07f41c8fb037f712e8c308917/ankimorphs/generators_window.py#L362

I'll add a warning about this in the guide. Thanks for the feedback @tanhoaian01!

github-actions[bot] commented 1 month ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.