Bug: False positives with partial matches

HEmile commented 2 years ago

I'm seeing some false positives in my vault. In this case, 'on' is highlighted with following replacements:

Here, 'matter' is highlighted

Same with 'zero'

I think this started happening on 1.4.1, and it wasn't like this in 1.4.0 .

(btw... apologies for all the issues I'm making in this repo. I think it will be super useful to me and this plugin will be of huge help to many in the community!)

Kolooooo commented 2 years ago

hi, HEmile I have a question：I downloaded the file from github and added it to the obsidian plugin library, and when I opened the plugin from obsidian, it said "Failed to load plugin obsidian-sidekick", I really don't know which file to add to the obsidian plugin library. Please advise, thank you~

hadynz commented 2 years ago

Thanks for raising this.

(btw... apologies for all the issues I'm making in this repo. I think it will be super useful to me and this plugin will be of huge help to many in the community!)

No issues at all. You are doing me a great service testing for me as you are.

I've been heads down focusing on supporting the selection of multiple words, single words, and "stemmed" single words in response to #19. This will change the behaviour that you are seeing again. I'll test the scenarios that you've raised to make sure you don't get the false positive examples you shared.

HEmile commented 2 years ago

I see, in that case it might be intentional.. . But since i have so many notes with long names (paper titles) this gives me a lot of recommendations compared to exact matching (and live preview looks messy because of it).

Ideally, I'd prefer the option to turn this off. Stemming on single word (or final word) replacements does sound very useful though

hadynz commented 2 years ago

Can you try the latest version that I just released - 1.5.0?

I've made sure that "stop words" are never indexed, so it should resolve the issue that you experienced.

Jinnayah commented 2 years ago

The partial match option is giving me too many matches to be useful in version 1.5.0. Here's an example from a basic note in my vault. The only match that might've had something useful was "principle" (and it didn't, but I can see where it could have). "30" is matching to every entry in my daily notes that happened to be on the 30th of a month.

This might be more useful to me if partial matches were limited. In my vault, for example, if only one or two notes match a partial, that might be useful. If a dozen match, it's just a common word and unlikely to have a meaningful link.

overactive_sidekick

hadynz commented 2 years ago

Thanks for the feedback @Jinnayah.

It sounds like any numbers in the text should never be highlighted and considered a stop word. That's something I can do.

Do you have any suggestions for how we can make partial matches more informed? There is feedback that people want it. But how do you think we can identify and surface a relevant partial match?

Alternatively. Do you think that when #3 is implemented that this becomes less of an issue as you can simply build your ignore list for your vault?

Jinnayah commented 2 years ago

Being able to exclude certain notes wouldn't help me in this case, but being able to add my own stop words would. This might be the easiest and most flexible for most people.

Another idea might be to have a threshold for matches. For example, here are the matches for 'principle' in the same file. Principle is one word of two for "Purcell Principle" and a journal entry, so there's a good chance those could be a match. It's one word out of 17 on the article about the Copernican Principle and 1 of 14 in the note about W.H.O., so it's less likely to be a match there. A threshold where the partial must match at least X% of the full name would filter out a lot of the false positives. (For my vault, it looks like a threshold around 15% of words would get rid of most false positives while still surfacing the good potential links.)

Principle_Matches

BTW, I just noticed that the note itself is being flagged a possible link and probably shouldn't be. This note is named "Cognitive ease principle", and is coming up as an option for that phrase and for "principle".

hadynz commented 2 years ago

BTW, I just noticed that the note itself is being flagged a possible link and probably shouldn't be. This note is named "Cognitive ease principle", and is coming up as an option for that phrase and for "principle".

Good call. This was a bug. Fixed in 1.5.1.

hadynz commented 2 years ago

Another idea might be to have a threshold for matches. For example, here are the matches for 'principle' in the same file. Principle is one word of two for "Purcell Principle" and a journal entry, so there's a good chance those could be a match. It's one word out of 17 on the article about the Copernican Principle and 1 of 14 in the note about W.H.O., so it's less likely to be a match there. A threshold where the partial must match at least X% of the full name would filter out a lot of the false positives. (For my vault, it looks like a threshold around 15% of words would get rid of most false positives while still surfacing the good potential links.)

That's not a bad idea. I will implement your suggestion and we can give this a go testing to see the usefulness of this change. Will let you know when the change is made.

HEmile commented 2 years ago

I would honestly be most happy with an option to disable partial matches. My vault really isn't set up in a way that partial matches make sense, since they are mostly (pretty long) paper titles.

HEmile commented 2 years ago

An example: The different replacement for 'generator'. I prefer to have control over this list by explicitly adding aliases, which gives me much more control and much fewer false positives (which are time consuming to scroll through!)

hadynz commented 2 years ago

Damn. That surely drives the point through.

Are any of those suggestions remotely useful for your use case any chance?

I've just come across RAKE which I'm going to trial out quickly to see if that is an even better solution than what I have at the moment.

HEmile commented 2 years ago

Are any of those suggestions remotely useful for your use case any chance?

Not really. I don't think it shows all the recommendations, since it filled my whole screen. There probably are some relevant recommendations like 'generative models', but I don't see it probably because it's ordered alphabetically.

Also the stemming is rather agressive, it seems to use 'generalized' for 'generator'.

laurastephsmith commented 2 years ago

Hi, I'll chime in on this conversation rather than start a new one. I've just installed this for the first time, the idea and the way you're approaching it is awesome! The first note I threw at it though gave matches for:

"things"
"really"
"back"
"going"
"to" They're basically what you might call "filler words" rather than potential keywords. My initial hunch is that I only want to match nouns, or combinations of words that contain nouns. But of course that would mean running the whole thing against a dictionary. Hmm... I'm being the unhelpful person here pointing out a problem without being able to properly define the problem, let alone suggest a workable solution! But I'm here because I think this plugin has SO much potential to be incredible! So I offer my train of thought in that spirit ;)

HEmile commented 2 years ago

@laurastephsmith I created a fork with a setting to disable the rather aggressive stemming. You can install it from here: https://github.com/HEmile/obsidian-sidekick/releases/tag/1.1.0 , hopefully that solves the problem! (It does for me).

laurastephsmith commented 2 years ago

@HEmile oo thanks, I'll give it a go!

hadynz / obsidian-sidekick

Bug: False positives with partial matches #25