jimmejardine / qiqqa-open-source

The open-sourced version of the award-winning Qiqqa research management tool for Windows
GNU General Public License v3.0
380 stars 64 forks source link

Bug/feature: The "keywords" should not be based on frequency, but their importance in the narrative. #189

Open raindropsfromsky opened 4 years ago

raindropsfromsky commented 4 years ago

In any document, the entire narrative is hinged around a few keywords. However, although the narrative is about these keywords, these words do not actually appear in the text at a high frequency.

Therefore, it is wrong to assume that all the words with the highest count are automatically the keywords in the document.

For example, here is the keyword list for the 721-page compendium.

image

We already know that the entire document is all about EC, EIA and Notifications; and the agencies are called MoEF, SEAC and SEIAA. Therefore listing those words as keywords does not serve any purpose: They will keep occurring throughout the document, and the software will not be able to add any value by highlighting all occurrences.

Desired behavior: It is OK to start with a few words with the highest count, but let the user edit the list and specify his own keywords.

Alternatively, display the word cloud separately, and let the user define the keywords manually. (It is difficult to guess the keywords in any article using text analysis.

GerHobbelt commented 4 years ago

Filing this for later; I'm getting a bit overwhelmed here, so primary focus will be PDF/OCR process. Then there's the big fat trouble with Google Scholar in the sniffer, which, to me, is priority number 2.