mdnwvn / librarycard

A dragon-themed discord group reading completion and suggestion tracker with book index summary information retrieval
0 stars 2 forks source link

Simplify Nominations #3

Open mdnwvn opened 6 months ago

mdnwvn commented 6 months ago

New functionality

When a user nominates a book, the bot should respond with a list of similar nominations already found (likely using MongoDB's native full-text search index). The user is then given a choice to:

Current functionality

The onus is on the user to figure out if a book has been nominated before in the current session, and thus slight misspellings or different formatting of the title may lead to nominations for the same book to not be properly grouped and counted together.

Grissess commented 4 months ago

Alright, here's the lowdown: doing this is going to be a little challenging :)

Notwithstanding my knowledge of Mongo's interface, it appears that it doesn't have any method of doing this that isn't tantamount to iterating over every entry to compute the edit distance in Python. Pursuant the migration in #16, I'm disappointed to also note that SQLite's FTS5 extension is only suitable for substring matches, and not edit distance in general.

Now, the recommendation here was to use Mongo's FTS indices, which I imagine are similar to SQLite's FTS. In both cases it should be noted their equivalence only up to tokenization; it's a tool most useful for having either words absent, or, at best, characters from words deleted. It won't help with other kinds of errors (transpositions, accidental insertions) as that will ruin the substring contiguity--that's why I'm looking more at something like Levenshtein distance, which is a more natural metric of "minimum amount of mistakes required to get here". I figure such a metric would be useful for books whose titles are likely to be misspelled in a particular way. Nevertheless, any such attempt to implement this is going to be defeasible by sufficiently-motivated users--the space of all strings, even alphabetic ones, is very large.

Furthermore, the user-decision leaves some edge cases--the current nomination system does not allow a book to be nominated whose title is an exact match to one in the library (the intent seeming to be to defeat nominations of already-read books). The presence of a fuzzy matching metric complicates things: should the library books also be added to the search indices? Should nominations that are "fuzzily close enough" to existing library books also be summarily rejected? If not, is it more acceptable to propose the extent, "close" nomination as a possible correction for a title entered that is, coincidentally, the same as an existing library book?

For now, my proposition is that there is some operator intervention, made easier by tools designed for the purpose; the most pressing is likely a command to merge nominations, such that operators (who are much better at judging intent) can migrate votes made to erroneous titles. This does require some engagement, of course, but it's a relatively minimal amount to adjudge such situations, and, for the time being, necessary (until algorithms exist which can have a sense of judgement on their own).

Any opposed to implementing such a merge functionality as a response to this issue?