mortii / anki-morphs

A MorphMan fork rebuilt from the ground up with a focus on simplicity, performance, and a codebase with minimal technical debt.
https://mortii.github.io/anki-morphs/
Mozilla Public License 2.0
58 stars 9 forks source link

Readability Report Generator does not skip a directory with a supported extension #142

Closed aleksejrs closed 8 months ago

aleksejrs commented 8 months ago

Describe the bug A directory with .txt trips RRG up.

To Reproduce Steps to reproduce the behavior:

  1. Create a directory name "directory.txt" in a Readability Analyzer corpus directory.
  2. Run Readability Report Generator on the corpus.
Traceback (most recent call last):
  File "aqt.taskman", line 142, in _on_closures_pending
  File "aqt.taskman", line 86, in <lambda>
  File "aqt.taskman", line 106, in wrapped_done
  File "aqt.operations", line 252, in wrapped_done
  File "…/.local/share/Anki2/addons21/472573498/generator_dialog.py", line 232, in _on_failure
    raise error
  File "concurrent.futures.thread", line 58, in run
  File "aqt.operations", line 242, in wrapped_op
  File "…/.local/share/Anki2/addons21/472573498/readability_report_generator.py", line 147, in _background_generate_report
    with open(input_file, encoding="utf-8") as file:
IsADirectoryError: [Errno 21] Is a directory: '…/fsrs/fsrs4anki.wiki.txt'

Desktop (please complete the following information):

mortii commented 8 months ago

IsADirectoryError: [Errno 21] Is a directory: '…/fsrs/fsrs4anki.wiki.txt'

I haven't considered files that have double extensions like this. I'll look into it.

mortii commented 8 months ago

@aleksejrs I'm not able to reproduce the problem.

With this file structure:

ja-test-report/
    - fsrs/
       -  fsrs4anki.wiki.txt
    - test_ja_1.txt
    - test_ja_2.txt

When I select the ja-test-report directory then I successfully get a report on all three files.

mortii commented 8 months ago

Create a directory name "directory.txt" in a Readability Analyzer corpus directory.

Oh wait, did you give a directory an extension? If that's the case then it's unsupported behavior.

aleksejrs commented 8 months ago

Can't it just check? The directory was a copy of the wiki where I added ".txt" to ".md" files for MorphMan.

mortii commented 8 months ago

Can't it just check?

I mean, in theory it's probably possible. That being said, I have never seen, nor heard of a directory having an extension before, so adding a new check for that would unnecessarily slow down an already slow process imo. Better to just remove the extension.

Btw, that is an amazing QA test, I'll definitely remember it.

aleksejrs commented 8 months ago

It could be something like

try:
    with open(input_file, encoding="utf-8") as file:
        …
except IsADirectoryError:
    continue

So it would not check until the error happens.

mortii commented 8 months ago

@aleksejrs sure, but catching every possible exception that could occur from user errors is not desirable--the code complexity would get unwieldy really fast. In this case I actually think ignoring the error is straight up bad, because here the error would let the user know that they have fundamental problem with their file structure, and then they can fix said structure and prevent future problems that could arise from having malformed directory names.

aleksejrs commented 8 months ago

@mortii If it just skips processing the directory as a file and still processes the files within it, I see no fundamental problem with the file structure. A Windows user would think a question mark is a problem with the file structure (as Anki does when exporting cards). What if my directory was from a Moldovan website and ended in .md?

mortii commented 8 months ago

The directory was a copy of the wiki where I added ".txt" to ".md" files for MorphMan.

@aleksejrs Could you give me the steps you used to copy a wiki that ended up with having a directory with an extension?

aleksejrs commented 8 months ago

@mortii I made a copy of the wiki, renamed the files in it, and renamed the directory to indicate that it contains txt files.

.md or any other TLD in a file name, however, would come from a downloader like httrack or DownThemAll!.

Directories do not have extensions because there is no meaning assigned to an extension in a directory's name, so whatever is at the end of it is just not an extension, but part of a string. Not because there is something wrong with that and it would look like a file (although it would in "ls|less" — that's why nobody uses that to find files to process).

.txt may have been a somewhat silly choice, but if any other dot in a directory file name can trip some program up, am I supposed to never use dots in a directory's name? I use anything possible but a line break or an invisible character or "|" (as long as it is not a tree of configuration files or something). The only programs I wouldn't be surprised to fail to distinguish directories and files are my own quick-and-dirty find commands, not a program that reads files in a directory of arbitrary books and articles.

MorphMan uses os.walk, which yields a 3-tuple (dirpath, dirnames, filenames):

        def accepted_filetype(filename):
            return filename.lower().endswith(('.srt', '.ass', '.txt', '.corpusdb'))

        list_of_files = list()
        for (dirpath, _, filenames) in os.walk(input_path):
            list_of_files += [os.path.join(dirpath, filename) for filename in filenames if accepted_filetype(filename)]

https://docs.python.org/3/library/stdtypes.html#str.endswith

mortii commented 8 months ago

@aleksejrs we only need the file names so os.walk does indeed seem like a better choice than glob. I'll try replacing it.

mortii commented 8 months ago

this now works with file structures like: fsrs.MD/fsrs.txt/fsrs.wiki.txt

next release is probably within a couple of days.

github-actions[bot] commented 7 months ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.