Scaling and indexing - Githubissues

This is in respect to enabling large scale cognitive databases and large rule sets. For now, I am focusing on RAM based databases, but I also want to understand the design choices for secondary storage, which could be measured in terabytes.

Right now chunks and rules are implemented as an open source JavaScript library with the following maps:

From chunk ID to the corresponding chunk, where IDs uniquely identify chunks within a given chunk graph
From chunk type to the set of chunks with that type
From chunk type and non-literal property values to the set of chunks in which they appear

This is just a start and needs improvement. The latter two maps currently use simple arrays for the set of chunks. If you want to get chunks with given values for the type and multiple properties, this involves the need to compute set intersection given multiple arrays where you are looking for the IDs that occur in each of the arrays.

What are good algorithms for that? One idea is to use a JavaScript associative array that maps an ID to an integer. You then iterate through each list and use the integer to count the number of arrays it appears in. When the count equals the number of arrays, you push the ID to an array for the results. The map can then be discarded.

This relies on JavaScript's implementation of maps. Would Bloom filters be better for really large databases? These sometimes give false positives, but never any false negatives. You thus need to verify the results, but the storage required is limited and lends itself to use in RAM even for sets that are too large to hold in RAM.

Another challenge is how to efficiently find which rules match the chunks held in the cognitive module buffers. Charles Forgy’s Rete algorithm is promising. A related idea is to compile rule conditions into a discrimination network. Whenever a module buffer is updated, this triggers the discrimination network to update the set of matching rules.

Forgy’s OPS5 rule language applied to the entire state of a database. For the cognitive rule language, rule conditions operate on module buffers that each hold a single chunk. There are only a few such modules, so in principle, we should be able to scale to much larger databases than OPS5. One complication is the need to update the discrimination network when adding, removing or revising rules.

I think that there are opportunities to improve performance using information about the expected utility of rules. This can be computed using reinforcement learning following the work done by John Anderson and colleagues at CMU.

Any advice you have on indexing algorithms would be much appreciated.

w3c / cogai

Scaling and indexing #38