Refactoring viewer objects/methods

junotk commented 11 years ago

I want to propose a major refactorization of viewer object/methods.

Currently if you use viewers you are basically dealing with (numpy) arrays, and passing these arrays to several functions (sim, simmat, manifold). But basically all these functions are either (1) function from a set of entities (words/docs/topics) to another set of entities, or (2) operation on a set of entities. So I thought it would be much easier if we build our viewer around these entities as classes.

Since our VSM models consists of three things, words, documents, and topics, we can basically have three objects, and viewers can operate based on these objects. Something I have in mind is as follows.

r = v.word('realism')   # create a new word object

r.sim_word_word(length=10)  # gives a new word object containing 10 most similar words to 'realism'

t = r.sim_word_top() # gives a new topic object containing topics sorted by relevance to 'realism'

t.topics()                  # view the inside of t.

and so on.

We can add on other classes on these basics. One is for handling a distance matrix and manifold learning.

t.dismat(sim_fn=....)      # gives a distance matrix based on sim_fn

t.clusters()    # gives clustering (based on the above dismat)

t.isomap()     # calculate 2d embeddings with isomap 

t.plot()          # plot 2d image based on the above

I think this facilitates the use of viewer. How do you think?

rrose1 commented 11 years ago

It's an interesting idea.

One way to think about a class is as a particular data structure with methods for working on that structure and with additional information about that structure. A Viewer class typically wraps the output of a model trainer (typically a real matrix or two) and lets you work on it in various ways.

At the moment, I'm not seeing what a word class should contain as data. It has a method or attribute, dist_word_top; does this mean it stores the word-by-topic matrix as well?

colinallen commented 11 years ago

I think it's an interesting idea too. One question, however, is whether it provides a developer-programmer advantage or a user-programmer advantage. As a user of the code, the current viewer methods are fairly transparent in that the word is explicitly passed in as a parameter to a method on the viewer are very transparent, and easily manipulated if one wants to try a different word. If one has to create a new word object every time one wants to explore the model around it, this seems like more work, and potentially less transparent. Of course, instead of things like r = v.word('realism') one could use realism = v.word('realism') and then things like realism.sim_word_word(length=10) would be more transparent, but the onus for that transparency falls on the user-programmer in a way that is likely to be only incompletely adopted as a practice. Also, how would one get the current kind of functionality where a list of can be passed to the viewer? Would we need arbitrary multi-word objects and viewers associated with them?

junotk commented 11 years ago

Robert, Each item class (word, docs, topics) can inherit both two matrices (doc_top and top_word) from viewer. Also docs may contain context and/or label info. All these attributes should be immutable. My understanding is that it does not end up with actually duplicating large matrices for each instance... is that right?

junotk commented 11 years ago

Colin, I intend this to be purely for user-programmer's sake. The first line of the above example (r = v.word('realism')) is just for illustration. In practice, one may start with

l = sim_word_word('realism', length=10)

which returns a list of 10 words similar to 'realism', which can then be used for further analysis like simmat or plotting. The idea is that returning word class (with methods) rather than just an array makes further analysis easier. For example, if you want to do an analysis similar to the above (i.e. plot isomap of 10 topics relevant to 'realism') in the current viewer, you need to type

top, p = zip(*sim_word_top('realism')[:10])

simmat = v.simmat_topics(top)

cls = v.clustering(simmat)

v.isomap(simmat, cls=cls)

This looks to me much more complicated and requires knowledge about the internal codes (you need to know functions, and what they require as arguments).

Finally, a word (or doc/topic) class is defines for a list of words, so it can well accomodate multiple words.

colinallen commented 11 years ago

Ok, looks good to me!

http://mypage.iu.edu/~colallen/

On Sep 14, 2013, at 17:20, junotk notifications@github.com wrote:

Colin, I intend this to be purely for user-programmer's sake. The first line of the above example (r = v.word('realism')) is just for illustration. In practice, one may start with

l = sim_word_word('realism', length=10) which returns a list of 10 words similar to 'realism', which can then be used for further analysis like simmat or plotting. The idea is that returning word class (with methods) rather than just an array makes further analysis easier. For example, if you want to do an analysis similar to the above (i.e. plot isomap of 10 topics relevant to 'realism') in the current viewer, you need to type

top, p = zip(*sim_word_top('realism')[:10])

simmat = v.simmat_topics(top)

cls = v.clustering(simmat)

v.isomap(simmat, cls=cls) This looks to me much more complicated and requires knowledge about the internal codes (you need to know functions, and what they require as arguments).

Finally, a word (or doc/topic) class is defines for a list of words, so it can well accomodate multiple words.

— Reply to this email directly or view it on GitHub.

rrose1 commented 11 years ago

I don't yet understand the proposed refactoring. You might fork, experiment and demo if it's too cumbersome to spell things out here.

junotk commented 11 years ago

I've been trying this in a new branch I'll push shortly.

junotk commented 11 years ago

A tentative refactored version is developed under refactoring branch. I also uploaded refactoring-tutorial demo to the notebook repository to show how that affects analysis procedures. Any feedback is appreciated.

doori commented 11 years ago

Jun,

This looks great. I can see how this can be easier for users as there little arguments for methods.

One thing I noticed when I was trying to use the plotting function was that I had to run 4 functions in this order. t.dismat() t.isomap() t.KMeans() t.plot() This wasn't very intuitive to me, although you have a nice Exception handling :) Is there a reason behind separating these functions? could we also have a function that combines all four?

Also when this refactoring is integrated, can we use Words and Documents classes for other model viewers that have sim_word_word, sim_doc_doc? May be you planned on this but we should make a separate file for ViewerUnit classes so the ldagibbsviewer doesn't get too cluttered.

junotk commented 11 years ago

Thanks for your feedback, Robert. You are right, some of them may be automatic. I was not sure about KMeans, since there are currently two options, isomap() or mds(), to generate coordinates, and also you can choose dimensions. Same for plot(). But maybe I can make them use some "default" setting.

I do plan to make them general and callable from other viewers. Actually I was thinking to migrate these classes to similarity.py, and make all sim_* functions methods of these classes. But that involves large modifications, so I wanted to make sure whether we really want that.

junotk commented 11 years ago

Oh, sorry, I meant Doori. Multiple re: in my gmail folder confuses me as to who was sending the message... so sorry!

rrose1 commented 11 years ago

I'm trying to understand what problem we're trying solve at this point. I thought the main issue, the call for refactoring, owed to the fact that the methods which produce 2-d plots are monolithic at the moment; while the user might reasonably want to intervene at any of its stages:

producing a set of vectors in a many-d space;
mapping these vectors into 2-d space;
plotting the results. So breaking up these methods is needed.

A minimal coding solution ('refactoring') would be to break these methods up into three methods with the necessary parameters and leave it to the user to pipeline the data through these methods. So there's an increase in complexity for the user (3 steps instead of 1) but also in control for the user. This is a scientific app; so the user should have whatever control that can be afforded to them, as transparently as possible.

However, users also want reasonable defaults that produce decent results in a lot of situations without managing parameters. The monolithic method provides this convenience. But this convenience can easily be added after refactoring in the form of supplemental methods that wrap the methods specific to each stage.

Your proposed changes to ldagibbsviewer involve more than this kind of refactoring; there would be a lot of additional code. What else are we trying to do then?

Here's some local concerns about the new classes:

It's unclear where they are situated vis-a-vis the class hierarchy. On the one hand, they seem to be wrappers of a particular viewer class. I take it that Words should, for example, be LdaGibbsWords since it contains methods specific to CGS output. But then why not subclass LdaGibbsViewer and define multiple kinds of LdaGibbsViewer? On the other hand, it seems that you expect these classes to be generalized wrappers for any kind of viewer. So in some sense, you're adding functionality to viewer classes in general, which would make Words a kind of base class in vsm.viewer. But this clearly isn't what's going on. I'm guessing that neither alternative is the desired one. I'm guessing that you want to store some data that gets produced from a viewer and associate some methods with that data. But we're having the usual struggle in OO-programing with encapsulating the dependencies of the desired methods in the new class.
Naming. Something called Words should not denote a wrapper for a viewer instance and some output from a viewer method; the same for Documents. Naming classes Words and Documents is an obvious thing to want to do when writing this kind of module; data in various shapes that we would refer to as words show up in numerous places in the workflow. But programming languages don't offer the flexible abstraction that natural languages do. For example, one would want to define a Corpus which consists of Documents which in turn consists of Words. And so we already have words in vsm: a word is an index into an indexed set (matrix) of vectors and Corpus.words is an array of such indices. There's no class called Words because it would just be superfluous and it's better that the user stay grounded in numpy. Similarly, we have Documents; only we're also calling them contexts much of the time. But they're arrays of indices into a small-'c' corpus, which is an array of words. This works, I thought, because if the user accepts this operational definition of words then she will be lead to understand crucial aspects of the mathematical models and their implementations.

junotk commented 11 years ago

ok, thanks for your comments. I agree that what we need is breaking up the monolithic plot function, but at the same time I think it would be nice if we could do that without much hustle. Currently the way a user need to pass a result from one method to another is unintuitive and not sophisticated at all. I thought bundling up data and function into a class we can achieve the both.

As to your two concerns, I don't think the first point is quite right. I can make them independent from LdaGibbsViewer easily, and add viewer specific methods (like view() in Topics) at each viewer's level. Neither do I see how the second point causes much issues. Naming is arbitrary. I can just change names.

Well, but if you don't like it, that's ok, let's forget about it. Maybe you can describe the ground layout you have in your mind? And to what you want to de-objectify it? Before doing this I migrated manifold learning functions to a class Manifold, but I guess you didn't like that, too. Do you want to put them in their original place as functions?

rrose1 commented 11 years ago

It would be helpful at this point if you would write out a specification for the changes you envisage. As I said, I don't know what problem you're trying to solve yet. A specification should have enough detail that any competent programmer familiar with vsm should be able to implement it.

'Currently, the way a user need to pass a result from one method to another is unintuitive and not sophisticated at all'.

This sounds serious, but I'm still just guessing as to what you have in mind.

I can make them independent from LdaGibbsViewer easily, and add viewer specific methods (like view() in Topics) at each viewer's level.

In the code, as it is, there's a class Words which has a method sim_word_top. So if I instantiated Words with a TfIdfViewer object, sim_word_top would fail because a TfIdfModel object does not have an attribute top_word:

https://github.com/inpho/vsm/blob/refactoring/vsm/viewer/ldagibbsviewer.py#L616

Likewise, Documents.dismat will get stuck if the model object does not have a doc_top attribute, which most don't:

https://github.com/inpho/vsm/blob/refactoring/vsm/viewer/ldagibbsviewer.py#L699

In a generic version, I imagine you'll have to remove these methods; but I don't see how things will work then. Either you'll have to subclass Words for each viewer type or you'll add methods to objects after they've been instantiated, which will ensure Words objects have a systematically inconsistent interface.

If these classes are truly generic, then they should depend only on the base classes of the viewer classes, the corpus classes and the model classes. Again, I'm guessing that for this to work, you'll want to be able to make some assumptions about what data structures any model that has vector representations of word meanings contains. But even among the small selection of models we have, there's little uniformity among the data structures, and typically for good reasons. When I designed these things, I tried to make the base classes as rich as possible. However, I found it to be counterproductive to try to shoehorn every possible model type into conformity wrt its data structures. The current BaseModel doesn't even extend to LDA or LSA; and it contains very little as it is.

junotk commented 11 years ago

I think you are right: making them usable by other models than LDA may be difficult. Actually I don't know much about them. I think I gonna retract my previous proposal and just modify some codes in the current LdaGibbsViewer.

I was not happy that you can't plug the result of, say, a sim_*_* function directly into dismat_* functions. But maybe this can be handled by plain if statements.

Another complication is clustering. Different clusterings require different data. AffinityPropagation and SpectralClustering require similarity matrix, whereas KMeans needs (preferably) lower dimensional euclidian coordinates. So a user must feed appropriate data to these functions.

But if a user can handle with them that is fine.

junotk commented 11 years ago

Given above I think most of "refactoring" is already done in manifold branch (which is now taken up in develop too). There are no change as to sim_* functions. What's new are:

dismat_* functions return Manifold object, which is a dissimilarity matrix with some information (used measure, label names, cluster info) and methods.
Various clustering (KMeans, Afinity propagation, Spectral clustering) as well as manifold learning (MDS, Isomap) can be executed by calling methods of Manifold object. Manifold learning also plots the result.

I only tested it with LdaGibbsviewer but believe this is non-specific to LDA. That is, as long as a model outputs a dissimilarity matrix this should work. In fact, clustering and manifold learning are all about distance matrices and don't care where they come from. So putting them outside a particular model makes sense to me.

rrose1 commented 11 years ago

I agree. The Manifold class achieves the goal of modular code. The essential piece of data that it wraps is a distance matrix, which we want to view as coordinate-free data determining a manifold, and associates various methods to that data.

The one rough spot I can see at the moment is that a user might want to adjust the label formatting while working with Manifold. A solution is to keep it the job of the viewer class to produce labels as now. But it should be able to do so without having to recompute the distance matrix. If a user wants to change the format of labels she could (a) go back to viewer and ask for new labels (via a different label function) (b) plug those into the Manifold object. So there's some refactoring of viewer to do there. Indeed, for convenience, we might add an update_labels-type method to the viewer class that takes a Manifold object and a label function and returns an updated Manifold object.

One point about the implementation. In the current version, Manifold both inherits from IndexedSymmArray and has dismat as an attribute. It shouldn't do both of these things. An IndexedSymmArray is already an ndarray. If we go that route, we would think about a Manifold as being a type of matrix (numpy array) with some additional methods and attributes. That might be pretty slick. The other route is for Manifold to inherit from object and to basically be a featureless container for data and methods (cf. any viewer class). This seems to be how it works at present. Subclassing ndarray is fairly involved in general. For now, I would suggest taking the second route.

junotk commented 11 years ago

I made Manifold a subclass of object in manifold as well as develop branches. I can anticipate a lot of a difficulty with the first route in handling with slicing or casting, so let's put that aside for now.

inpho / vsm

Refactoring viewer objects/methods #71