jimmejardine / qiqqa-open-source

The open-sourced version of the award-winning Qiqqa research management tool for Windows
GNU General Public License v3.0
366 stars 60 forks source link

Expedition producing an "and" theme #334

Open cjpell1 opened 3 years ago

cjpell1 commented 3 years ago

Possibly related to #185

v83.0.7656.6401

I have run expedition several times on my library of ~2500 documents. Every new run was after adding a couple new documents to keep everything up to date. Each new run would result in themes pretty close to the previous which was to be expected.

After running expedition at some point in time (unfortunately I can't narrow down an exact document), I now get an "and" theme tacked on to nearly every theme, including an "and" theme all by itself.

I have tried running with and without autotags, and with and without tags, and even black listed "and" from autotags just in case (It never showed up as a tag but I figured it couldnt hurt).

Ive also tried adjusting the number of themes to as low as 5 and as high as 50 and "and" still shows up on nealy all the themes. Some screenshots because I realized it migh be a bit confusing,

Edit: I just tried running with only autotags and 25 themes and it did not produce the "and"

QiqqaCapture2 QiqqaCapture

GerHobbelt commented 3 years ago

Yes, related to #185.

This one is pretty hairy as Qiqqa does not know about the English language like we do. AFAICT (off the top of my head; haven't reproduced the issue yet), Qiqqa's internal LDA algorithm run can produce "odd" keywords like this one, as the machine doesn't know about "stop words".

Note (to self): What certainly IS a BUG (in my opinion), is this one not listening to the blacklist (& whitelist) -- a subtlety there being that Qiqqa doesn't automatically regenerate the autotags when you edit the block/whitelists (for reasons of cost), so any such filter should be applied as a kind of "post/pre-process" to ensure the blacklisted tags don't get discovered via LDA.

jpmorr commented 2 years ago

I also have this problem on a library with about 2300 documents. The vast mjority of the results are "the", "of", "to", "and", "a". See the attached image, I'm using v83.0.7656.6401.

image

The auto-tag results are actually not bad, despite not being able to merge common ones - "3d" and "3D" are the same thing for me (and most people), but I can't merge the tags into a single one.