Open hinneburg opened 9 years ago
True document frequencies of the term topic combination: ("文化", 54)
select
count(distinct DOCUMENT_TERM_TOPIC.DOCUMENT_ID)
from
DOCUMENT_TERM_TOPIC
where
DOCUMENT_TERM_TOPIC.TOPIC_ID in
(
select
T2.TOPIC_ID
from
TOPIC T1 JOIN TOPIC T2 ON
(
T1.HIERARCHICAL_TOPIC$START <= T2.HIERARCHICAL_TOPIC$START and
T2.HIERARCHICAL_TOPIC$END <= T1.HIERARCHICAL_TOPIC$END and
T2.HIERARCHICAL_TOPIC$START = T2.HIERARCHICAL_TOPIC$END
)
where
T1.TOPIC_ID=54
) and
TERM ="文化"
;
+-------------------------------------------------+
| count(distinct DOCUMENT_TERM_TOPIC.DOCUMENT_ID) |
+-------------------------------------------------+
| 42204 |
+-------------------------------------------------+
Number of documents that contain two tokens that are the same word but with different topics. There documents are counted twice (the error) in the implemented summation.
select
count(distinct DTT1.DOCUMENT_ID)
from
DOCUMENT_TERM_TOPIC DTT1
JOIN DOCUMENT_TERM_TOPIC DTT2
ON
(
DTT1.DOCUMENT_ID = DTT2.DOCUMENT_ID and
DTT1.TERM ="文化" and
DTT2.TERM ="文化" and
DTT1.TOPIC_ID in
(
select
T2.TOPIC_ID
from
TOPIC T1 JOIN TOPIC T2 ON
(
T1.HIERARCHICAL_TOPIC$START <= T2.HIERARCHICAL_TOPIC$START and
T2.HIERARCHICAL_TOPIC$END <= T1.HIERARCHICAL_TOPIC$END and
T2.HIERARCHICAL_TOPIC$START = T2.HIERARCHICAL_TOPIC$END
)
where
T1.TOPIC_ID=32
) and
DTT2.TOPIC_ID in
(
select
T2.TOPIC_ID
from
TOPIC T1 JOIN TOPIC T2 ON
(
T1.HIERARCHICAL_TOPIC$START <= T2.HIERARCHICAL_TOPIC$START and
T2.HIERARCHICAL_TOPIC$END <= T1.HIERARCHICAL_TOPIC$END and
T2.HIERARCHICAL_TOPIC$START = T2.HIERARCHICAL_TOPIC$END
)
where
T1.TOPIC_ID=105
)
)
;
Topic 54 consists of 105 and 32, while 105 in turn consists of 22 and 23. Number below 50 are leafs.
Number of doubly counted documents with "文化" assigned to topics 32 and 105: 1643 Number of doubly counted documents with "文化" assigned to topics 22 and 23 : 2090
Number of documents with "文化" assigned to topic 105: sum minus doubly counted, (39472) - 2090 = 37382
Number of documents with "文化" assigned to topic 54 : sum minus doubly counted recursively, (45937) - 1643 - 2090 = 42204
Idea computing the doubly counted documents bottom up for all terms.
First check, whether the doubly counted documents can be computed for all terms of the children of a topic.
select
DTT1.TERM ,
count(distinct DTT1.DOCUMENT_ID)
from
DOCUMENT_TERM_TOPIC DTT1
JOIN DOCUMENT_TERM_TOPIC DTT2
ON
(
DTT1.DOCUMENT_ID = DTT2.DOCUMENT_ID and
DTT1.TERM = DTT2.TERM and
DTT1.TOPIC_ID in
(
select
T2.TOPIC_ID
from
TOPIC T1 JOIN TOPIC T2 ON
(
T1.HIERARCHICAL_TOPIC$START <= T2.HIERARCHICAL_TOPIC$START and
T2.HIERARCHICAL_TOPIC$END <= T1.HIERARCHICAL_TOPIC$END and
T2.HIERARCHICAL_TOPIC$START = T2.HIERARCHICAL_TOPIC$END
)
where
T1.TOPIC_ID=50
) and
DTT2.TOPIC_ID in
(
select
T2.TOPIC_ID
from
TOPIC T1 JOIN TOPIC T2 ON
(
T1.HIERARCHICAL_TOPIC$START <= T2.HIERARCHICAL_TOPIC$START and
T2.HIERARCHICAL_TOPIC$END <= T1.HIERARCHICAL_TOPIC$END and
T2.HIERARCHICAL_TOPIC$START = T2.HIERARCHICAL_TOPIC$END
)
where
T1.TOPIC_ID=51
)
)
group by
DTT1.TERM
;
9384 rows in set (56.06 sec)
This takes some time but seems still efficient on bunka_kultur
with 76.000 documents.
This is a bottom up ordering of the nodes in the hierarchy
select
TOPIC_ID,
HIERARCHICAL_TOPIC$START,
HIERARCHICAL_TOPIC$END,
HIERARCHICAL_TOPIC$DEPTH
from
TOPIC
order by
HIERARCHICAL_TOPIC$DEPTH desc,
HIERARCHICAL_TOPIC$START,
HIERARCHICAL_TOPIC$END
;
Work all the nodes in this topological ordering (assuming backward edges) skip the leafs for an inner node v that corresponds to a topic pick the two children compute the doubly counted documents update the document_Frequencies of term_topic table for the term, topic combination
when aggregating topics, wrong document frequencies are shown for terms. Specifically, the frequencies exceed the total number of documents.