Closed sashafrey closed 9 years ago
Let's finally fix this issue :)
Please review the suggestion below and let me know if you have any comments.
Basically the idea is to replace "int topics_count" into a repeated field "string topic_name" in the ModelConfig message.
message ModelConfig { "optional int topics_count" -> "repeated string topic_name;" }
E.g. in the model config user must define not just the number of topics, but rather the list of their string identifiers. Topic names must be unique withing the model config, but there is no requirement to make them globally unique in ARTM. The goal of this change is to
We may also support simple regular expressions to let user tell us which topics the regularizer should apply to. For example, user may define topics named as f"background1, background2, topic1, topic2, topic3". Then topic_to_regularizer might just have one value "background*" as a topic name, and this means that regularizer is applied to all background topics.
3 For each word the class TopicModel will still store nwt and rwt as a vectors of length equal to the number of topics in the model. The TopicModel will have a std::vectorstd::string list that will identify the correspondence between topic names and the indices in TopicModel.
4 RequestThetaMatrix and RequestTopicModel should accept the list of topics user wants to retrieve.
5 In internals.proto we must add the list of topic_names into ModelIncrement message, so that merger knows that exactly how to merge the increment into his current model. This will take care of the Reconfigure() during ongoing collection scan. We may consider not doing it now and postponing for later.
There might be more things to cover here, but this gives a general idea of how do we want to identify topics in BigARTM.
This had been fixed in BigARTM code: https://github.com/bigartm/bigartm/commit/5034091e74a2abae8347dde536df5080d5fbef69
Currently the concept of 'Topic' is quite unrepresented in the code. Basically, there is a number of topics associated with each model, and after that topic is just an index in some arrays/matrices. This makes it difficult to associate extra information with each topic (for example, a flag 'subject topic' vs 'background topic').