Extend the functionality of Dataset

machine-intelligence-laboratory / TopicNet

Interface for easier topic modelling.

https://machine-intelligence-laboratory.github.io/TopicNet

MIT License

139 stars 17 forks source link

Extend the functionality of Dataset #57

Open bt2901 opened 4 years ago

bt2901 commented 4 years ago

Something along the lines of "convert between Counter and vowpal_wabbit" would be very helpful.

Also, maybe we need to store more metadata (such as main modality and co-occurrences)

Related code: https://github.com/machine-intelligence-laboratory/OptimalNumberOfTopics/blob/master/topnum/scores/arun.py (which is especially relevant now since we distribute the descriptions of corpora that are obtained using this code)

Evgeny-Egorov-Projects commented 4 years ago

Just on the "off" note: I still don't understand the "sacred" meaning of the main modality. Isn't taking ANY modality as main and then recalculating the weights accordingly brings them to the "equal ground" so to speak? Or more mathematically: there exists hyperplane of "equal regularization effect" and setting coeeficient to one for of one of them scales others acordingly?

Alvant commented 4 years ago

Let me break into the discussion and say a couple of words in defense of main modality :)

This is not imho about equal weights or something. Topic modeling is about analyzing texts. So, it is reasonable to provide a way to tell somehow the plain text (aka main modality) from other modalities (which are either meta info such as author or title; or manually created fancy things like bigram, trigram, skipgram etc. god knows what else is possible to come up with). A user may want to build models solely on plain text. Or she may want to use this modality for coherence computation, for example (if words of main modality are in natural order in the VW, but other modalities are in bag-of-words). So, main modality == preprocessed raw text.

Or maybe it would be better to give it some other name (not main modality, but preprocessed_text or plain_text?)

bt2901 commented 4 years ago

I agree with @Alvant, but I want to add another consideration.

In many models, multiplying every modality weight by the same constant should leave the model unchanged (as a consequence, you definitely could recalculate weights based on any modality). However, this is not the case when regularizers are involved. If we want to transfer good taus between different domains (and we do want that: that's the point of the relative coefficients tech), regularization coefficients must be "to the same scale", so to speak. I believe that relating them to the main_modality's weight (1 by default) is the most natural choice.