Readability report hangs indefintely when input is too long

mortii / anki-morphs

A MorphMan fork rebuilt from the ground up with a focus on simplicity, performance, and a codebase with minimal technical debt.

https://mortii.github.io/anki-morphs/

Mozilla Public License 2.0

58 stars 9 forks source link

Readability report hangs indefintely when input is too long #182

Open saitekii opened 6 months ago

saitekii commented 6 months ago

Describe the bug Readability report hanging indefinitely generating report from custom txt files

To Reproduce I used a folder that had many txt files that I created from transcripts. They are all UTF-8 encoded. I also did this with other txt files that I created of different transcripts and that one generates correctly. Are there any special characters that would cause it to hand indefinitely. I am using the Japanese Morphemizer for Japanese transcripts.

Screenshots

Desktop (please complete the following information):

OS: Windows 10
Anki Version 23.12.1
AnkiMorphs Version 1.3.0

Additional Info Cannot close the window or Anki without ending the task in the task manager.

mortii commented 6 months ago

That sucks.

It might be related to #168: windows sub-process terminals can sometimes use different encodings which leads to all sorts of problems.

Could you try to narrow down the problematic file/files and then send those to me for testing?

A binary search approach would be good for this, basically take half the files and test it on those, if it still fails, then take out half again, and repeat until you find out which files are causing the problem.

saitekii commented 6 months ago

Here is a couple file it does not like. Strangely, the first file I can actually scan, but it shows 0 for all of the stats which it should not. However, when I edit the text and remove the English characters or even delete a newline, I get the hanging again. https://drive.google.com/file/d/1LArfsCFo_wqUHkkfeVET5vai7NS-VK9K/view?usp=drive_link https://drive.google.com/file/d/1K5JT2FzajAfA5q1UWj97-aaGabVu4x8P/view?usp=sharing (I'm going to delete these in a bit)

mortii commented 6 months ago

I'm testing it on ubuntu right now and the 73.txt file works fine for me.

The 39.txt file does show zero for everything, so there is definitely something weird happening.

I'll look into it!

mortii commented 6 months ago

It produces this error in MeCab (the Japanese morphemizer):

line in stdout: b'input-buffer overflow. The line is splitted. use -b #SIZE option.\n'

when you add some line breaks to the text then it works fine.

I think this is actually a reasonable constraint, the buffer has to set a limit somewhere, and the entirety of 39.txt is a single line, which is, indeed, very long.

It should display that error to the user instead of hanging forever though... I'll try to add that.

saitekii commented 6 months ago

You are correct. That was the problem.

I made a quick python script that adds a "\n" after every "。" that it comes across on all the transcript files in the directory and now it generates a report without a problem on all of the files. I didn't realize they were all one long line.

Thank you! And thanks for making Ankimorphs!

mortii commented 6 months ago

I'll actually keep this issue open so I don't forget about it. Unsubscribe if you don't want any notification related to this anymore :pray: