Generate auto-tag by analyzing the headings and body-text also (not just the titles)

@raindropsfromsky commented 3 hours ago • (was filed as https://github.com/GerHobbelt/qiqqa-open-source/issues/13)

According to the manual, the autotag feature works by analyzing the titles of the articles.

This works well with a large collection of research papers (where each research paper has only a few pages).

However, please also consider the legal research field, especially in a country where Common law is practiced. In these countries, lessons are drawn from previous court judgments also.

Thus, to prepare for a case, the researcher has to go through the following and find potentially relevant material:

scanned documents obtained from the respondents (typically government agencies)
A large number of statutes (acts, rules, policies, govt notifications, previous court judgments, etc.)

(The first set is case-related; while the second set is the permanent reference).

We cannot rely on the titles of the documents alone: We have to dig deep and find matching keywords or key phrases.

In other words, Qiqqa must do a deep text-mining in headings and even body text; and find phrases and keywords that are common between set#1 and set#2.

For example, take a case related to Environment Clearance. Here, Qiqqa must find all statutes which include "environment clearance". It should rank the documents based on the frequency of the keyword, to find the most relevant references.

As a second example, take the "insurance for accident committed in a DUI case". Here, Qiqqa should search for three words: "insurance", "DUI" and "accident" (there is no key phrase here). Again, we need Qiqqa to find documents that are most suited for the purpose.

In both cases, Qiqqa must be able to analyze the body text and headings. (It may be faster if Qiqqa scans only the headings in the first round, and displays quick results; but continues to scan the body text in the next round, and update the results later.

jimmejardine / qiqqa-open-source

Generate auto-tag by analyzing the headings and body-text also (not just the titles) #170