sonjaolkkonen / spell-checker

0 stars 0 forks source link

Peer Review #1

Open TheNushu opened 2 months ago

TheNushu commented 2 months ago

14.June.24 at 17:50

Wow, the project looks impressive and I am impressed by the quality and quantity of it. I am not that versed in the web development area (basic html, css and a little javascript) so I hope that my feedback won't miss anything important.

I am also learning Finnish, but not advanced level. I translated the documentation and instructions with google translate and chat gpt 4. I also used words that I know, and sentences in Finnish for testing. When I had problems with the language, I used the 2 previously mentioned tools.

I love the idea of the project and the implementation of it. This is an interesting functionality that we use almost every day and we might not think about it that deeply. When diving deeper into the topic, we can notice what challenges of performance it involves, to parse such amounts of data.

I remember watching a Veritasium (pop Science youtube channel) video tackling the subject and application a few months ago, but I couldn't find it for the life of me to share it with you, I am sorry.

The file structure of the Github page is clear, easy to navigate through and the instructions were easy to implement and use. It was very convenient to test and clone the project without any issues.

When setting up the project, the following warning appeared: image

It did not cause any problems, and from Googling around, it didn't seem to be critical.

There was a bug in the spell_checker.py that added the suggested words after every iteration for which I suggested a simple fix that solved most of it. Turned running times from 15s-3min into almost constantly avg of 10s and clearer output for the user.

After implementing another change to the trie get_trie_content function, the computing time of distance 1 words turned from 10s to avg 3s. (this change was suggested by chat gpt 4, it was the only time when I asked help form an LLM in this review)

Manual testing:

In the beginning I was almost exclusively documenting the bug and the behavior of it from multiple manual tests. You can skip this part and check the bugfix suggestion in the next chapter.

tl:dr the same suggestion got repeated several times. after n iterations, the output had n+1 repetitions of the same word suggested

1st box: Some inputs were taking 3+ min and without output (also 2nd box). But running it later the same day, seemed to not reproduce the issue for example for the input "koira" or "kodra". I am not sure if the author did some updates in the meantime (since the last commits were the day prior), but great. I spent around 1-2 hours trying to identify the performance issue and reading similar implementations online, but the code itself seemed to not have any performance issue so that was weird.

It takes a good chunk to parse through all the dictionary of correct words when the word tested doesn't exist there. For example, the word "olemme" that is not the words.txt took around 20s for the output: image

(also, great output that the word "is not in the dictionary" compared to "is incorrect")

For words that are in the dictionary, but fairly long with 1 distance from the correct word, took considerable amount of time (2.4 min). "modaalikuitu" is in the dictionary, with input "modaalikuidu": image

The same corrected word gets repeated sever times (11 times), so I am tended to believe that

Input: "mido" Intended: "miso" (total time 28s) image

Here, each suggestion gets repeated 13 times. (26 words / 2 suggestions)

As a conclusion, after each iteration of a word (not necessarily the same word, the program continued with the same behavior after different word manual testing), each suggested word got repeated n+1 times, where n is the number of appearances in the last run. Maybe there are some caching issues, or some data structures

don't get initialized with empty?

The first time we test the code, n=1: each suggested word gets repeated 2 times.

This keeps track of uses of 2nd box too. For example, after reaching n = 12, I ran a test on 2nd box. 2nd box was unaffected, I did another test on 1st box, the word got repeated 14 times. (following the trend of: after every test, no matter the box, n = n + 1)

This test (the one just mentioned above) was concluded on the first box only.

The 2nd box doesn't seem to be affected by this

The trend was shown to continue unaffected if the test was done with: distance 1 insertion, substitution or if 2nd box was used in between tests of box 1.

What is also weird for the performance issue, is that the unit testing is fairly performant, it takes a little time and it computes the same operations quite a significant amount of iterations, but not in a big amount of time.

2nd box: The same performance issue as in box 1 (to be expected, since they use the same logic, and input consists overall of more characters).

Weirdly, it seems to be more efficient than box 1.

For input: "Oho, kokra on velko" Intended: "Oho, koira on velho", with kokra and volko having distance 1, the output was faster but not the desired one. Probably because the program doesn't go through all the possible suggestions for each word? (which would mean a hugely increased time complexity in this case. doing functionality of box 1 for each word). But "on" is not in words.txt (neither is "in", wonder why does it get "corrected" then) image

Bugfixing and code suggestions

spell_checker.py and words.txt

  1. I have found a "somewhat" decent bug fix to the issue above, in the spell_checker.py, suggest function. It makes the output consistent, with only 1 suggested word per iteration, and doesn't increase the performance time with each iteration.

We should change the following block from the following: (add line 2, to check if the suggestion is already in the suggestion dictionary)


if distance <= 1:
    suggestions.append(vocabulary_word)
    if one_word:
        return suggestions

Into:

if distance <= 1:
    if not vocabulary_word in suggestions:
        suggestions.append(vocabulary_word)
    if one_word:
        return suggestions
  1. In the fix typos function:
    for word in words:
    if len(word) > 1 and not self.find_word(word.lower()):
        suggestions = self.suggest(word.lower(), True)
        if len(suggestions) == 0:
            corrected_words.append(word)
            able_to_correct = False
        else:
            corrected_words.append(suggestions[0])

    I am guessing that this is for the 2nd box functionality. If you would like to also suggest different text alternatives to the multiple word inputs, you could save the version of each word and also present them to the user?

If I interpreted, right, suggestions[0] takes the first suggestion of the "suggest" function, thus accessing them with suggestions[n] shouldn't be too difficult I would imagine.

Maybe multiple versions of the same corrected text?

  1. Some short words such as: ei, joo, jo etc, are not present in the words.txt, maybe it would be a good additions to also include some list of words of shorter ones for broader use? Since some of them are very common and, if I am not mistaken, the commonality of words wasn't a criteria when deciding which words the resource decided to include in the document.

trie.py

  1. The following point was recommended by chat gpt 4. Since I am not that familiar with the trie data structure, after reading about it on wikipedia and seeing your code, I didn't think of any ways on how it could be improved. I asked the LLM if the implementation is efficient, after which it recommended to initialise the self.content = [] each time you call the get_trie_content function.

Turning the code of the function into: (extra line at the beginning)

def get_trie_content(self):
    self.content = []
    self._dfs(self.root, "")
    return self.content

This has increased the performance dramatically and from the few tests I performed after, it seemed to keep the code to have the same outputs.

Please bear in mind that I didn't fully test this option and I am not really sure of the implications over all of this change, so do it with care.

The following lines between quotation marks are the "reasoning" that gpt 4 proposed for this change:

"The method get_trie_content() initiates a depth-first search (_dfs) and populates the self.content list. However, this list is stored persistently as part of the Trie object state, which is not reset between calls to get_trie_content(). This can lead to duplicate entries if the method is called multiple times. To fix this, you can initialize self.content at the beginning of get_trie_content() to ensure it only contains the current content of the trie each time it's called."

  1. I am not entirely sure how flask works, but for the load_words, it might be a good idea to include some exceptions in case that: there are no permissions to read/write file, or the file doesn't exist.

damerau_levenshtein.py

The distance code seems good and I couldn't find any issues with it upon examining it. I didn't ask chat gpt 4 to examine this code either since I didn't want to use it too much in this review.

Some UI suggestions:

  1. After a word gets the suggestions, the only option on the UI is to "the word was correct? add to list".
  1. I like the functionality of adding a word as correct if not already in the list. This is useful to make it more "tailored" for a specific user.

I would see some implications as:

  1. I wonder if it would be a better idea to keep the boxes on the screen even after the running of one of the 2 functionality, did the author run into any efficiency problems with this approach?

Unittesting

All of the unittests seem suitable and I couldn't think of any other tests that would be needed and are not present.

All of the logic is tested and the code passes 100% of the tests.

Coverage also gives 100% score.

If you want to include the time it takes to compute different functionalities in your unittesting, maybe you can implement something of the sort with the indications in this conversation: https://stackoverflow.com/questions/9502516/how-to-know-time-spent-on-each-test-when-using-unittest since unittest doesn't have this natively.

Considering that you want to use mock objects, I am not sure if that will be different. But using time after execution - start time should still be valid I would imagine.

This should be useful to ensure that the application doesn't exceed certain running times.

Conclusion

Thank you for your project, I have learned a lot through it and I plan to make the file structure better (similar to yours) for my own project. This was an interesting task and onnea! :D

P.S. Initially I formatted the feedback in word, but when trying to paste the content in markdown, there have been some changes. Thus the format of the feedback might not be as good, I appologise for this.

sonjaolkkonen commented 2 months ago

Hi! Thanks for the code review, you had some great points :)

The bug that you found was due to my error as I had previously initialised the self.content = [] in the get_trie_content method. But when I ran pylint it was complaining that the list was not initialised in the class constructor and after I added it to the class constructor I also for some reason deleted it completely from the get_trie_content method :D So that was the reason for the bug and the code works when you just add the initialisation so there's no need to edit the suggest method. Thanks for pointing out the bug!