Closed junotk closed 10 years ago
Regarding the first approach, a user can pass in a label function that would parse the json. An analogous approach is in the htrc extension. This would spare us from having to define a record format.
Once we have a working database, I expect that it will be natural that view_metadata
would return a json file or a similarly generic interchange format as received from the db. I would be in favor of reproducing this behavior when an external database is not being used but the locally stored record arrays are being used.
A related issue to be worked out concerns storage of redundant metadata. Currently, metadata for a given context is completely encapsulated. For example, if one had a corpus of the individual articles from 20 issues of a journal, the user would have little choice at present but to store a full citation with each article. Although the redundancy does have its convenience; and it will be possible to work with Corpus objects without loading them fully into volatile memory.
I wanted to note as well that the way that Corpus
is currently coded, you could use either of these approaches without changing any of the existing code. For each context type, context_data
contains a record (numpy structured array). This record can be extended however you like (other than to overwrite the idx
field, which contains the index into the underlying corpus). For example, if you want to add a json
field with string type to it, you can. If you want to add a dozen different fields of different types with bibliographic info, you can.
Thanks, it looks like a reasonable way. So basically the idea is storing document titles in *_label
field (so that they will be picked up by def_label_fn
), and everything else in different field(s), right?
In future it may be useful if, when creating a corpus
, a user can specify which field of her metadata goes to the document labels and which to others.
Yes, I completely agree. The htrc extension contains in pieces what essentially is a more elaborate and specialized corpus builder function. It would be nice to have a generic version (as well as a label function generator). I wonder if Doori would have any thoughts on this.
I agree that users can benefit from a generic label function generator. Jun is right that in def_label_fn
, it includes fields that end with 'label'. But with the new generic label function, I think the user can just specify, say in a list parameter, what she wants to display in the viewing step. In other words, I don't think it is necessary to specify what you want to view when you are creating a corpus - since the information you want to view may be different for different tasks.
The reason for the *_label convention was to facilitate 'rapid' corpus building and later viewing. I.e., a user could feed a list of labels into a corpus builder function like toy_corpus and later run dist_doc_doc and see those labels without ever having to had to think about metadata fields, etc.
Let's think about a way to achieve this goal and also to have the flexibility desired above. I agree that the *_label convention smacks of "hard-coding"....
Thanks for your thoughts, guys! In any way, I think we want to keep def_label_fn
along with the current convention that every field ending with _label
goes to the document name, otherwise we end up rebuilding all existing corpora.
I have written for my own purpose a corpus builder function from json files (not pushed yet), so for the moment I gonna revise it to be able to have more complex fields.
I've pushed a tentative version of json_corpus
(in corpusbuilder.py) to develop branch. It builds a corpus out of a json file. There are two metadata fields: document_label
and metadata
. You specify which key in a json file to be used as labels. All the rest are gathered in metadata
.
Current
Corpus
supports only strings as a metadata for each document, which usually is the name of the document. But we sometimes want richer metadata, including author name, publication date, etc. What would be the best way to put them?One solution, as proposed in our meeting last week, was to cast all the metadata, maybe stored in a json format, into the current str format. The problem of this is not only it's ugly but poses an issue for
sim_*_doc
methods, which list documents by their metadata. (So you'll see all your json metadata in a resulting list. Not so pretty.)The second solution would be to allow dict as a document metadata. The downside of this is that we have to decide which key to use in
sim_*_doc
results.So we have two options, each with its issue. Is there an alternative way? If not, I personally prefer the second way. Let me hear your thoughts.