sashafrey / topicmod

This project had been moved to https://github.com/bigartm/bigartm
Other
0 stars 0 forks source link

Avoid too large Protobuf messages (TopicModel, ThetaMatrix and DictionaryConfig) #104

Open sashafrey opened 10 years ago

sashafrey commented 10 years ago

Google protobuf messages should be limited to 64M. It is possible to change this limit, but the overall suggestion is to keep all objects small.

Currently we have three potentially huge objects - TopicModel, ThetaMatrix and DictionaryConfig. All of them can be naturally split into smaller pieces (TopicModel can represent a subset of all words, ThetaMatrix - subset of items, DictionaryConfig - subset of entries). We should design and change c_interface so that it allows transfer of this objects in pieces.

sashafrey commented 10 years ago

A suggested change is as follows:

  1. ArtmCopyRequestResult returns a positive number, describing if there are more messages to transfer as the result of previous request. The user must keep retrieving those requests until ArtmCopyRequestResult returns 0. This will solve the issue for ArtmRequestThetaMatrix and ArtmRequestTopicModel.
  2. ArtmOverwriteTopicModel should be renamed to ArtmUpdateTopicModel. It should still accept a TopicModel message as an argument, but the behavior should be not to overwrite the entire topic model, but rather to updates just those tokens that are included in the provided TopicModel. The implementation of ArtmUpdateTopicModel should ask the merger to replace the corresponding piece of the topic model with the provided topic model. This operation must be async (merger just places the update as a new increment into the mergerqueue). This imply that ::artm::core::ModelIncrement needs a new Boolean flag, defining whether the increment must be added to the existing values, or whether it has to be fully overwritten. User should call ArtmWaitIdle() to wait until the merger actually pushes all increments to the main topic model.
  3. ArtmCreateDictionary() and ArtmReconfigureDictionary() should be replaced with ArtmUpdateDictionary(). To make implementation efficient we should probably use something like concurrent_hash_map to store the dictionaries and allow updates the dictionaries without coping them every time we want to update them.
sashafrey commented 10 years ago
  1. in network modus operandi limit the size of ModelIncrement that goes from node to master
  2. ArtmCopyRequestResult should return the size of the buffer for the next chunk (0 if it was the last chunk)
  3. ArtmUpdateTopicModel must support removal of tokens. This must go as a separate UpdateTopicModel operation. TopicModel message must carry no token_weights (this is a way to indicate that tokens must be deleted).