hinneburg / TopicExplorer

TopicExplorer
GNU Affero General Public License v3.0
11 stars 3 forks source link

webapp_te: wrong document frequency for hierachical topics #230

Open hinneburg opened 9 years ago

hinneburg commented 9 years ago

when aggregating topics, wrong document frequencies are shown for terms. Specifically, the frequencies exceed the total number of documents.

hinneburg commented 9 years ago

True document frequencies of the term topic combination: ("文化", 54)

select
  count(distinct DOCUMENT_TERM_TOPIC.DOCUMENT_ID)
from
  DOCUMENT_TERM_TOPIC
where 
  DOCUMENT_TERM_TOPIC.TOPIC_ID in 
    (
    select 
      T2.TOPIC_ID 
    from 
      TOPIC T1 JOIN TOPIC T2 ON 
        (
        T1.HIERARCHICAL_TOPIC$START <= T2.HIERARCHICAL_TOPIC$START and
        T2.HIERARCHICAL_TOPIC$END  <= T1.HIERARCHICAL_TOPIC$END and
        T2.HIERARCHICAL_TOPIC$START = T2.HIERARCHICAL_TOPIC$END
        )
    where
      T1.TOPIC_ID=54
    ) and
  TERM ="文化"
;

+-------------------------------------------------+
| count(distinct DOCUMENT_TERM_TOPIC.DOCUMENT_ID) |
+-------------------------------------------------+
|                                           42204 |
+-------------------------------------------------+

Number of documents that contain two tokens that are the same word but with different topics. There documents are counted twice (the error) in the implemented summation.

select
  count(distinct DTT1.DOCUMENT_ID)
from
  DOCUMENT_TERM_TOPIC DTT1 
    JOIN DOCUMENT_TERM_TOPIC DTT2 
    ON 
     (
     DTT1.DOCUMENT_ID = DTT2.DOCUMENT_ID and
     DTT1.TERM ="文化" and
     DTT2.TERM ="文化" and
     DTT1.TOPIC_ID in 
       (
       select 
         T2.TOPIC_ID 
       from 
         TOPIC T1 JOIN TOPIC T2 ON 
           (
           T1.HIERARCHICAL_TOPIC$START <= T2.HIERARCHICAL_TOPIC$START and
           T2.HIERARCHICAL_TOPIC$END  <= T1.HIERARCHICAL_TOPIC$END and
           T2.HIERARCHICAL_TOPIC$START = T2.HIERARCHICAL_TOPIC$END
           )
       where
         T1.TOPIC_ID=32
       ) and
     DTT2.TOPIC_ID in 
       (
       select 
         T2.TOPIC_ID 
       from 
         TOPIC T1 JOIN TOPIC T2 ON 
           (
           T1.HIERARCHICAL_TOPIC$START <= T2.HIERARCHICAL_TOPIC$START and
           T2.HIERARCHICAL_TOPIC$END  <= T1.HIERARCHICAL_TOPIC$END and
           T2.HIERARCHICAL_TOPIC$START = T2.HIERARCHICAL_TOPIC$END
           )
       where
         T1.TOPIC_ID=105

       )
     )
;     

Topic 54 consists of 105 and 32, while 105 in turn consists of 22 and 23. Number below 50 are leafs.

Number of doubly counted documents with "文化" assigned to topics 32 and 105: 1643 Number of doubly counted documents with "文化" assigned to topics 22 and 23 : 2090

Number of documents with "文化" assigned to topic 105: sum minus doubly counted, (39472) - 2090 = 37382

Number of documents with "文化" assigned to topic 54 : sum minus doubly counted recursively, (45937) - 1643 - 2090 = 42204

Idea computing the doubly counted documents bottom up for all terms.

First check, whether the doubly counted documents can be computed for all terms of the children of a topic.

select
  DTT1.TERM ,
  count(distinct DTT1.DOCUMENT_ID)
from
  DOCUMENT_TERM_TOPIC DTT1 
    JOIN DOCUMENT_TERM_TOPIC DTT2 
    ON 
     (
     DTT1.DOCUMENT_ID = DTT2.DOCUMENT_ID and
     DTT1.TERM = DTT2.TERM and
     DTT1.TOPIC_ID in 
       (
       select 
         T2.TOPIC_ID 
       from 
         TOPIC T1 JOIN TOPIC T2 ON 
           (
           T1.HIERARCHICAL_TOPIC$START <= T2.HIERARCHICAL_TOPIC$START and
           T2.HIERARCHICAL_TOPIC$END  <= T1.HIERARCHICAL_TOPIC$END and
           T2.HIERARCHICAL_TOPIC$START = T2.HIERARCHICAL_TOPIC$END
           )
       where
         T1.TOPIC_ID=50
       ) and
     DTT2.TOPIC_ID in 
       (
       select 
         T2.TOPIC_ID 
       from 
         TOPIC T1 JOIN TOPIC T2 ON 
           (
           T1.HIERARCHICAL_TOPIC$START <= T2.HIERARCHICAL_TOPIC$START and
           T2.HIERARCHICAL_TOPIC$END  <= T1.HIERARCHICAL_TOPIC$END and
           T2.HIERARCHICAL_TOPIC$START = T2.HIERARCHICAL_TOPIC$END
           )
       where
         T1.TOPIC_ID=51

       )
     )
   group by 
     DTT1.TERM
;     

9384 rows in set (56.06 sec)

This takes some time but seems still efficient on bunka_kultur with 76.000 documents.

This is a bottom up ordering of the nodes in the hierarchy

select 
  TOPIC_ID,
  HIERARCHICAL_TOPIC$START, 
  HIERARCHICAL_TOPIC$END, 
  HIERARCHICAL_TOPIC$DEPTH  
from 
  TOPIC 
order by  
  HIERARCHICAL_TOPIC$DEPTH desc,
  HIERARCHICAL_TOPIC$START,
  HIERARCHICAL_TOPIC$END
  ;

Work all the nodes in this topological ordering (assuming backward edges) skip the leafs for an inner node v that corresponds to a topic pick the two children compute the doubly counted documents update the document_Frequencies of term_topic table for the term, topic combination