hinneburg / TopicExplorer

TopicExplorer
GNU Affero General Public License v3.0
11 stars 3 forks source link

Core Document.Number_of_Tokens is Number of Characters #240

Open hinneburg opened 8 years ago

hinneburg commented 8 years ago

In table DOCUMENT is the field NUMBER_of_TOKENS not the number of tokens, but the number of characters.

Compare

select DOCUMENT_ID, NUMBER_OF_TOKENS from DOCUMENT where DOCUMENT_ID=361165;
+-------------+------------------+
| DOCUMENT_ID | NUMBER_OF_TOKENS |
+-------------+------------------+
|      361165 |             5595 |
+-------------+------------------+
1 row in set (0.00 sec)

with

select DOCUMENT_ID, COUNT(*) from DOCUMENT_TERM where DOCUMENT_ID=361165;
+-------------+----------+
| DOCUMENT_ID | COUNT(*) |
+-------------+----------+
|      361165 |      743 |
+-------------+----------+
1 row in set (0.00 sec)

Fix: Change the fill statement for table DOCUMENT.