We need different solutions for accessing the underlying wikipediaminer data. Only using in-process, in-memory data storage has a number of downsides:
initialising the service / restoring its state takes long
easy to run OOM for large wpm dumps (such as english :-)
no way of load balancing
The upside of keeping everything in-process, in-memory is speed - even though remote storage can also keep everything in memory, you'll always have an extra layer of indirection (i.e. request / response transport).
We should refactor the code to support multiple backends for the wpm dumps. The first refactoring to enable this has already been done in commit ce0d13f1c7ccdd21fab6ebcf12f9b52b7bfd8c25. We currently support:
wpm.wpmdata_inproc.WpmDataInProc - keep the data in-process in-memory
wpm.wpmdata_redis.WpmDataRedis - keep data in redis
All new storage drivers should inherit from wpm.base.Data and implement all functions (it's really an interface, but Python doesn't seem to support that in a nice enough way). The instance is created during runtime based on the configuration value wpmdatasource, which should be the classname of the implementation (e.g. wpm.wpmdata_inproc.WpmDataInProc).
The configuration module still needs to be adjusted to support a bit more flexible loading of different paramaters.
We need different solutions for accessing the underlying wikipediaminer data. Only using in-process, in-memory data storage has a number of downsides:
The upside of keeping everything in-process, in-memory is speed - even though remote storage can also keep everything in memory, you'll always have an extra layer of indirection (i.e. request / response transport).
We should refactor the code to support multiple backends for the wpm dumps. The first refactoring to enable this has already been done in commit ce0d13f1c7ccdd21fab6ebcf12f9b52b7bfd8c25. We currently support:
wpm.wpmdata_inproc.WpmDataInProc
- keep the data in-process in-memorywpm.wpmdata_redis.WpmDataRedis
- keep data in redisAll new storage drivers should inherit from
wpm.base.Data
and implement all functions (it's really an interface, but Python doesn't seem to support that in a nice enough way). The instance is created during runtime based on the configuration value wpmdatasource, which should be the classname of the implementation (e.g.wpm.wpmdata_inproc.WpmDataInProc
).The configuration module still needs to be adjusted to support a bit more flexible loading of different paramaters.