Closed junotk closed 11 years ago
It's an interesting idea.
One way to think about a class is as a particular data structure with methods for working on that structure and with additional information about that structure. A Viewer
class typically wraps the output of a model trainer (typically a real matrix or two) and lets you work on it in various ways.
At the moment, I'm not seeing what a word class should contain as data. It has a method or attribute, dist_word_top
; does this mean it stores the word-by-topic matrix as well?
I think it's an interesting idea too. One question, however, is whether it provides a developer-programmer advantage or a user-programmer advantage. As a user of the code, the current viewer methods are fairly transparent in that the word is explicitly passed in as a parameter to a method on the viewer are very transparent, and easily manipulated if one wants to try a different word. If one has to create a new word object every time one wants to explore the model around it, this seems like more work, and potentially less transparent. Of course, instead of things like r = v.word('realism') one could use realism = v.word('realism') and then things like realism.sim_word_word(length=10) would be more transparent, but the onus for that transparency falls on the user-programmer in a way that is likely to be only incompletely adopted as a practice. Also, how would one get the current kind of functionality where a list of can be passed to the viewer? Would we need arbitrary multi-word objects and viewers associated with them?
Robert, Each item class (word, docs, topics) can inherit both two matrices (doc_top and top_word) from viewer. Also docs may contain context and/or label info. All these attributes should be immutable. My understanding is that it does not end up with actually duplicating large matrices for each instance... is that right?
Colin, I intend this to be purely for user-programmer's sake. The first line of the above example (r = v.word('realism')) is just for illustration. In practice, one may start with
l = sim_word_word('realism', length=10)
which returns a list of 10 words similar to 'realism', which can then be used for further analysis like simmat or plotting. The idea is that returning word class (with methods) rather than just an array makes further analysis easier. For example, if you want to do an analysis similar to the above (i.e. plot isomap of 10 topics relevant to 'realism') in the current viewer, you need to type
top, p = zip(*sim_word_top('realism')[:10])
simmat = v.simmat_topics(top)
cls = v.clustering(simmat)
v.isomap(simmat, cls=cls)
This looks to me much more complicated and requires knowledge about the internal codes (you need to know functions, and what they require as arguments).
Finally, a word (or doc/topic) class is defines for a list of words, so it can well accomodate multiple words.
Ok, looks good to me!
http://mypage.iu.edu/~colallen/
On Sep 14, 2013, at 17:20, junotk notifications@github.com wrote:
Colin, I intend this to be purely for user-programmer's sake. The first line of the above example (r = v.word('realism')) is just for illustration. In practice, one may start with
l = sim_word_word('realism', length=10) which returns a list of 10 words similar to 'realism', which can then be used for further analysis like simmat or plotting. The idea is that returning word class (with methods) rather than just an array makes further analysis easier. For example, if you want to do an analysis similar to the above (i.e. plot isomap of 10 topics relevant to 'realism') in the current viewer, you need to type
top, p = zip(*sim_word_top('realism')[:10])
simmat = v.simmat_topics(top)
cls = v.clustering(simmat)
v.isomap(simmat, cls=cls) This looks to me much more complicated and requires knowledge about the internal codes (you need to know functions, and what they require as arguments).
Finally, a word (or doc/topic) class is defines for a list of words, so it can well accomodate multiple words.
— Reply to this email directly or view it on GitHub.
I don't yet understand the proposed refactoring. You might fork, experiment and demo if it's too cumbersome to spell things out here.
I've been trying this in a new branch I'll push shortly.
A tentative refactored version is developed under refactoring
branch. I also uploaded refactoring-tutorial
demo to the notebook repository to show how that affects analysis procedures. Any feedback is appreciated.
Jun,
This looks great. I can see how this can be easier for users as there little arguments for methods.
One thing I noticed when I was trying to use the plotting function was that I had to run 4 functions in this order. t.dismat() t.isomap() t.KMeans() t.plot() This wasn't very intuitive to me, although you have a nice Exception handling :) Is there a reason behind separating these functions? could we also have a function that combines all four?
Also when this refactoring is integrated, can we use Words and Documents classes for other model viewers that have sim_word_word, sim_doc_doc? May be you planned on this but we should make a separate file for ViewerUnit classes so the ldagibbsviewer doesn't get too cluttered.
Thanks for your feedback, Robert. You are right, some of them may be automatic. I was not sure about KMeans, since there are currently two options, isomap() or mds(), to generate coordinates, and also you can choose dimensions. Same for plot(). But maybe I can make them use some "default" setting.
I do plan to make them general and callable from other viewers. Actually I was thinking to migrate these classes to similarity.py
, and make all sim_*
functions methods of these classes. But that involves large modifications, so I wanted to make sure whether we really want that.
Oh, sorry, I meant Doori. Multiple re: in my gmail folder confuses me as to who was sending the message... so sorry!
I'm trying to understand what problem we're trying solve at this point. I thought the main issue, the call for refactoring, owed to the fact that the methods which produce 2-d plots are monolithic at the moment; while the user might reasonably want to intervene at any of its stages:
A minimal coding solution ('refactoring') would be to break these methods up into three methods with the necessary parameters and leave it to the user to pipeline the data through these methods. So there's an increase in complexity for the user (3 steps instead of 1) but also in control for the user. This is a scientific app; so the user should have whatever control that can be afforded to them, as transparently as possible.
However, users also want reasonable defaults that produce decent results in a lot of situations without managing parameters. The monolithic method provides this convenience. But this convenience can easily be added after refactoring in the form of supplemental methods that wrap the methods specific to each stage.
Your proposed changes to ldagibbsviewer involve more than this kind of refactoring; there would be a lot of additional code. What else are we trying to do then?
Here's some local concerns about the new classes:
Words
should, for example, be LdaGibbsWords
since it contains methods specific to CGS output. But then why not subclass LdaGibbsViewer
and define multiple kinds of LdaGibbsViewer
? On the other hand, it seems that you expect these classes to be generalized wrappers for any kind of viewer. So in some sense, you're adding functionality to viewer classes in general, which would make Words
a kind of base class in vsm.viewer
. But this clearly isn't what's going on. I'm guessing that neither alternative is the desired one. I'm guessing that you want to store some data that gets produced from a viewer and associate some methods with that data. But we're having the usual struggle in OO-programing with encapsulating the dependencies of the desired methods in the new class.Words
should not denote a wrapper for a viewer instance and some output from a viewer method; the same for Documents
. Naming classes Words
and Documents
is an obvious thing to want to do when writing this kind of module; data in various shapes that we would refer to as words
show up in numerous places in the workflow. But programming languages don't offer the flexible abstraction that natural languages do. For example, one would want to define a Corpus
which consists of Documents
which in turn consists of Words
. And so we already have words
in vsm: a word is an index into an indexed set (matrix) of vectors and Corpus.words
is an array of such indices. There's no class called Words
because it would just be superfluous and it's better that the user stay grounded in numpy. Similarly, we have Documents
; only we're also calling them contexts
much of the time. But they're arrays of indices into a small-'c' corpus, which is an array of words. This works, I thought, because if the user accepts this operational definition of words
then she will be lead to understand crucial aspects of the mathematical models and their implementations.ok, thanks for your comments. I agree that what we need is breaking up the monolithic plot function, but at the same time I think it would be nice if we could do that without much hustle. Currently the way a user need to pass a result from one method to another is unintuitive and not sophisticated at all. I thought bundling up data and function into a class we can achieve the both.
As to your two concerns, I don't think the first point is quite right. I can make them independent from LdaGibbsViewer easily, and add viewer specific methods (like view() in Topics) at each viewer's level. Neither do I see how the second point causes much issues. Naming is arbitrary. I can just change names.
Well, but if you don't like it, that's ok, let's forget about it. Maybe you can describe the ground layout you have in your mind? And to what you want to de-objectify it? Before doing this I migrated manifold learning functions to a class Manifold
, but I guess you didn't like that, too. Do you want to put them in their original place as functions?
It would be helpful at this point if you would write out a specification for the changes you envisage. As I said, I don't know what problem you're trying to solve yet. A specification should have enough detail that any competent programmer familiar with vsm should be able to implement it.
'Currently, the way a user need to pass a result from one method to another is unintuitive and not sophisticated at all'.
This sounds serious, but I'm still just guessing as to what you have in mind.
I can make them independent from LdaGibbsViewer easily, and add viewer specific methods (like view() in Topics) at each viewer's level.
In the code, as it is, there's a class Words
which has a method sim_word_top
. So if I instantiated Words
with a TfIdfViewer
object, sim_word_top
would fail because a TfIdfModel
object does not have an attribute top_word
:
https://github.com/inpho/vsm/blob/refactoring/vsm/viewer/ldagibbsviewer.py#L616
Likewise, Documents.dismat
will get stuck if the model object does not have a doc_top
attribute, which most don't:
https://github.com/inpho/vsm/blob/refactoring/vsm/viewer/ldagibbsviewer.py#L699
In a generic version, I imagine you'll have to remove these methods; but I don't see how things will work then. Either you'll have to subclass Words
for each viewer type or you'll add methods to objects after they've been instantiated, which will ensure Words
objects have a systematically inconsistent interface.
If these classes are truly generic, then they should depend only on the base classes of the viewer classes, the corpus classes and the model classes. Again, I'm guessing that for this to work, you'll want to be able to make some assumptions about what data structures any model that has vector representations of word meanings contains. But even among the small selection of models we have, there's little uniformity among the data structures, and typically for good reasons. When I designed these things, I tried to make the base classes as rich as possible. However, I found it to be counterproductive to try to shoehorn every possible model type into conformity wrt its data structures. The current BaseModel doesn't even extend to LDA or LSA; and it contains very little as it is.
I think you are right: making them usable by other models than LDA may be difficult. Actually I don't know much about them. I think I gonna retract my previous proposal and just modify some codes in the current LdaGibbsViewer
.
I was not happy that you can't plug the result of, say, a sim_*_*
function directly into dismat_*
functions. But maybe this can be handled by plain if statements.
Another complication is clustering. Different clusterings require different data. AffinityPropagation and SpectralClustering require similarity matrix, whereas KMeans needs (preferably) lower dimensional euclidian coordinates. So a user must feed appropriate data to these functions.
But if a user can handle with them that is fine.
Given above I think most of "refactoring" is already done in manifold
branch (which is now taken up in develop
too). There are no change as to sim_*
functions. What's new are:
dismat_*
functions return Manifold
object, which is a dissimilarity matrix with some information (used measure, label names, cluster info) and methods.Manifold
object. Manifold learning also plots the result.I only tested it with LdaGibbsviewer but believe this is non-specific to LDA. That is, as long as a model outputs a dissimilarity matrix this should work. In fact, clustering and manifold learning are all about distance matrices and don't care where they come from. So putting them outside a particular model makes sense to me.
I agree. The Manifold
class achieves the goal of modular code. The essential piece of data that it wraps is a distance matrix, which we want to view as coordinate-free data determining a manifold, and associates various methods to that data.
The one rough spot I can see at the moment is that a user might want to adjust the label formatting while working with Manifold
. A solution is to keep it the job of the viewer class to produce labels as now. But it should be able to do so without having to recompute the distance matrix. If a user wants to change the format of labels she could (a) go back to viewer and ask for new labels (via a different label function) (b) plug those into the Manifold
object. So there's some refactoring of viewer to do there. Indeed, for convenience, we might add an update_labels
-type method to the viewer class that takes a Manifold
object and a label function and returns an updated Manifold
object.
One point about the implementation. In the current version, Manifold
both inherits from IndexedSymmArray
and has dismat
as an attribute. It shouldn't do both of these things. An IndexedSymmArray
is already an ndarray
. If we go that route, we would think about a Manifold
as being a type of matrix (numpy array) with some additional methods and attributes. That might be pretty slick. The other route is for Manifold
to inherit from object
and to basically be a featureless container for data and methods (cf. any viewer class). This seems to be how it works at present. Subclassing ndarray
is fairly involved in general. For now, I would suggest taking the second route.
I made Manifold
a subclass of object
in manifold as well as develop branches. I can anticipate a lot of a difficulty with the first route in handling with slicing or casting, so let's put that aside for now.
I want to propose a major refactorization of viewer object/methods.
Currently if you use viewers you are basically dealing with (numpy) arrays, and passing these arrays to several functions (sim, simmat, manifold). But basically all these functions are either (1) function from a set of entities (words/docs/topics) to another set of entities, or (2) operation on a set of entities. So I thought it would be much easier if we build our viewer around these entities as classes.
Since our VSM models consists of three things, words, documents, and topics, we can basically have three objects, and viewers can operate based on these objects. Something I have in mind is as follows.
and so on.
We can add on other classes on these basics. One is for handling a distance matrix and manifold learning.
I think this facilitates the use of viewer. How do you think?