Document metadata format

junotk commented 10 years ago

Current Corpus supports only strings as a metadata for each document, which usually is the name of the document. But we sometimes want richer metadata, including author name, publication date, etc. What would be the best way to put them?

One solution, as proposed in our meeting last week, was to cast all the metadata, maybe stored in a json format, into the current str format. The problem of this is not only it's ugly but poses an issue for sim_*_doc methods, which list documents by their metadata. (So you'll see all your json metadata in a resulting list. Not so pretty.)

The second solution would be to allow dict as a document metadata. The downside of this is that we have to decide which key to use in sim_*_doc results.

So we have two options, each with its issue. Is there an alternative way? If not, I personally prefer the second way. Let me hear your thoughts.

rrose1 commented 10 years ago

Regarding the first approach, a user can pass in a label function that would parse the json. An analogous approach is in the htrc extension. This would spare us from having to define a record format.

Once we have a working database, I expect that it will be natural that view_metadata would return a json file or a similarly generic interchange format as received from the db. I would be in favor of reproducing this behavior when an external database is not being used but the locally stored record arrays are being used.

A related issue to be worked out concerns storage of redundant metadata. Currently, metadata for a given context is completely encapsulated. For example, if one had a corpus of the individual articles from 20 issues of a journal, the user would have little choice at present but to store a full citation with each article. Although the redundancy does have its convenience; and it will be possible to work with Corpus objects without loading them fully into volatile memory.

rrose1 commented 10 years ago

I wanted to note as well that the way that Corpus is currently coded, you could use either of these approaches without changing any of the existing code. For each context type, context_data contains a record (numpy structured array). This record can be extended however you like (other than to overwrite the idx field, which contains the index into the underlying corpus). For example, if you want to add a json field with string type to it, you can. If you want to add a dozen different fields of different types with bibliographic info, you can.

http://docs.scipy.org/doc/numpy/user/basics.rec.html

junotk commented 10 years ago

Thanks, it looks like a reasonable way. So basically the idea is storing document titles in *_label field (so that they will be picked up by def_label_fn), and everything else in different field(s), right?

In future it may be useful if, when creating a corpus, a user can specify which field of her metadata goes to the document labels and which to others.

rrose1 commented 10 years ago

Yes, I completely agree. The htrc extension contains in pieces what essentially is a more elaborate and specialized corpus builder function. It would be nice to have a generic version (as well as a label function generator). I wonder if Doori would have any thoughts on this.

doori commented 10 years ago

I agree that users can benefit from a generic label function generator. Jun is right that in def_label_fn, it includes fields that end with 'label'. But with the new generic label function, I think the user can just specify, say in a list parameter, what she wants to display in the viewing step. In other words, I don't think it is necessary to specify what you want to view when you are creating a corpus - since the information you want to view may be different for different tasks.

rrose1 commented 10 years ago

The reason for the *_label convention was to facilitate 'rapid' corpus building and later viewing. I.e., a user could feed a list of labels into a corpus builder function like toy_corpus and later run dist_doc_doc and see those labels without ever having to had to think about metadata fields, etc.

Let's think about a way to achieve this goal and also to have the flexibility desired above. I agree that the *_label convention smacks of "hard-coding"....

junotk commented 10 years ago

Thanks for your thoughts, guys! In any way, I think we want to keep def_label_fn along with the current convention that every field ending with _label goes to the document name, otherwise we end up rebuilding all existing corpora.

I have written for my own purpose a corpus builder function from json files (not pushed yet), so for the moment I gonna revise it to be able to have more complex fields.

junotk commented 10 years ago

I've pushed a tentative version of json_corpus (in corpusbuilder.py) to develop branch. It builds a corpus out of a json file. There are two metadata fields: document_label and metadata. You specify which key in a json file to be used as labels. All the rest are gathered in metadata.

inpho / vsm

Document metadata format #81