Aggregation of words/text/corpora using a database

romulogoncalves commented 5 years ago

It has to work outside memory. The corpora is really big and we want to do point an range selections.

jvdzwaan commented 5 years ago

Pinging @egpbos

What we want to build is a database that contains words (tokens)/documents/corpora. It should serve as a memory for keeping track of what words you have seen in the past. The immediate use case is the TICCLAT project, which is about TICCL, software that does ocr post correction and/or spelling correction and/or word normalization. Basically, TICCL uses the corpus to determine what the correct form of a word in corpus is. It uses all kinds of heuristics to select the best 'correction candidate' for each word that isn't in the dictionary (it is of course more complicated that this, I'm trying to keep the description as simple as possible). The idea is to create a database that serves as an external memory for TICCL. Because spelling changes over time, you need to be able to select documents/corpora based on criteria such as year of publication, region (for dialects), etc.

We started a database design document: https://github.com/TICCLAT/explore/blob/master/database_design.md (The document talks about anagram hashes. TICCL uses a hash function that maps all words with the same characters to the same hash value. So 'dog' and 'god' would have the same hash value. This hash value is used for finding correction candidates.)

jvdzwaan commented 5 years ago

We want to aggregate over words/documents/corpora, e.g., generating frequency lists.

A solution like Elasticsearch won't work, because it tries to do aggregations over the vocabulary in memory and this will fail if the corpus gets larger.

Elasticsearch is a full text search engine and: 'Search needs to answer the question "Which documents contain this term?", while sorting and aggregations need to answer a different question: "What is the value of this field for this document?".' (see https://www.elastic.co/guide/en/elasticsearch/reference/current/fielddata.html)

From the same document (https://www.elastic.co/guide/en/elasticsearch/reference/current/fielddata.html): Instead, text fields use a query-time in-memory data structure called fielddata. This data structure is built on demand the first time that a field is used for aggregations, sorting, or in a script. It is built by reading the entire inverted index for each segment from disk, inverting the term ↔︎ document relationship, and storing the result in memory, in the JVM heap. (...) Fielddata can consume a lot of heap space, especially when loading high cardinality text fields.

jvdzwaan commented 5 years ago

ticclat database design

c-martinez commented 5 years ago

Will report back in a couple of months :-)

jvdzwaan commented 5 years ago

So far, our mysql database works (we made some changes to the database design), we are able to ingest all data provided to us so far and specific queries don't take too long.

It seems we don't need further advice from the data sig. We are focussing on visualizing the data now.

nlesc-sigs / data-sig

Aggregation of words/text/corpora using a database #32