vnadgir / dkpro-core-asl

Automatically exported from code.google.com/p/dkpro-core-asl
0 stars 0 forks source link

Add InvertedIndexWriter #605

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Some (external) software packages, e.g. 
https://code.google.com/p/princeton-statistical-learning/, require inverted 
indexes for handling their data. I propose to implement a writer that writes 
such an index into a file in the simple format:

<token>\t<index>

To my knowledge, no such component exists in DKPro yet, does it?

Original issue reported on code.google.com by carsc...@gmail.com on 31 Mar 2015 at 5:09

GoogleCodeExporter commented 9 years ago

Original comment by schno...@ukp.informatik.tu-darmstadt.de on 31 Mar 2015 at 5:10

GoogleCodeExporter commented 9 years ago
We always used Lucene when we had to build an inverted index. 

I don't understand your simple format. To me, an inverted index is a two column 
table in which the first column is the token and the second column is a list of 
documents in which this token occurs (for efficiency best encoded using some 
delta-type encoding)

Original comment by richard.eckart on 1 Apr 2015 at 8:50

GoogleCodeExporter commented 9 years ago
Right, I suppose my terminology has been unclear (or wrong). I have actually 
had a simple token mapping "token -> index" in mind.
My motivation has been that the above-mentioned software does not operate on 
strings, but on token indexes.

Original comment by schno...@ukp.informatik.tu-darmstadt.de on 1 Apr 2015 at 8:57

GoogleCodeExporter commented 9 years ago
What's a token index then, simply the running number of the token like this?

1 this
2 is
3 a
4 test

?

Original comment by richard.eckart on 1 Apr 2015 at 8:58

GoogleCodeExporter commented 9 years ago
Yes. With distinct tokens only.

Original comment by schno...@ukp.informatik.tu-darmstadt.de on 1 Apr 2015 at 8:59

GoogleCodeExporter commented 9 years ago

Original comment by schno...@ukp.informatik.tu-darmstadt.de on 1 Apr 2015 at 9:06

GoogleCodeExporter commented 9 years ago
Sounds a bit like a n-gram writer with n set to 1 plus added indexes. The web1t 
module contains such a n-gram writer and supports very large numbers of tokens. 
Probably that can be easily extended to optionally prefix each tokenwith a 
running number (and to omit the token count). The n-gram writer in web1t also 
already supports parameters for indexing other information than the token text, 
e.g. lemma.

Original comment by richard.eckart on 1 Apr 2015 at 9:15

GoogleCodeExporter commented 9 years ago
Yes, sounds right. I was not aware of that and will check whether it actually 
has the functionality I need.

Original comment by schno...@ukp.informatik.tu-darmstadt.de on 1 Apr 2015 at 9:19

GoogleCodeExporter commented 9 years ago
Following the comments, this turns out to be unnecessary as a dedicated module.

Original comment by schno...@ukp.informatik.tu-darmstadt.de on 10 Apr 2015 at 10:32