opencog / miner

Frequent and surprising subhypergraph pattern miner for the AtomSpace.
Other
8 stars 15 forks source link

Optimize pattern frequency calculation #26

Open ngeiswei opened 4 years ago

ngeiswei commented 4 years ago

Problem

Currently, pattern frequency (required for calculating the empirical probability during surprisingness evaluation) is calculated by enumerating all its matches and dividing by the universe count. Such enumeration is costly, especially in RAM. On a real world dataset, such as used in

https://github.com/opencog/miner/tree/master/examples/miner/mozi-ai

or

https://github.com/ngeiswei/reasoning-bio-as-xp

it easily maxes out 32GB of RAM. This has been improved by subsampling/bootstrapping the dataset based on an estimate of the empirical probability. Such estimate can be very wrong though, leading to under or over subsampling, thus innacurracies or RAM explosions.

Solutions

  1. Improve the subsampling/bootstrapping mechanism, maybe auto-tuned via binary search, etc.
  2. Introduce a dedicated pattern matcher callback that takes less memory, maybe only saving the atom hashes rather than the atoms themselves, or maybe saving nothing at all but still somehow guarantying not to recount matches.
ngeiswei commented 4 years ago

Boostrapping seems to work fairly well, however it's still too slow for large data sets, thus introducing a dedicated pattern matcher callback could be welcome.