Problem

Currently, pattern frequency (required for calculating the empirical probability during surprisingness evaluation) is calculated by enumerating all its matches and dividing by the universe count. Such enumeration is costly, especially in RAM. On a real world dataset, such as used in

https://github.com/opencog/miner/tree/master/examples/miner/mozi-ai

https://github.com/ngeiswei/reasoning-bio-as-xp

it easily maxes out 32GB of RAM. This has been improved by subsampling/bootstrapping the dataset based on an estimate of the empirical probability. Such estimate can be very wrong though, leading to under or over subsampling, thus innacurracies or RAM explosions.

Solutions

Improve the subsampling/bootstrapping mechanism, maybe auto-tuned via binary search, etc.
Introduce a dedicated pattern matcher callback that takes less memory, maybe only saving the atom hashes rather than the atoms themselves, or maybe saving nothing at all but still somehow guarantying not to recount matches.

opencog / miner

Optimize pattern frequency calculation #26

Problem

Solutions