Closed GoogleCodeExporter closed 9 years ago
Original comment by schno...@ukp.informatik.tu-darmstadt.de
on 31 Mar 2015 at 5:10
We always used Lucene when we had to build an inverted index.
I don't understand your simple format. To me, an inverted index is a two column
table in which the first column is the token and the second column is a list of
documents in which this token occurs (for efficiency best encoded using some
delta-type encoding)
Original comment by richard.eckart
on 1 Apr 2015 at 8:50
Right, I suppose my terminology has been unclear (or wrong). I have actually
had a simple token mapping "token -> index" in mind.
My motivation has been that the above-mentioned software does not operate on
strings, but on token indexes.
Original comment by schno...@ukp.informatik.tu-darmstadt.de
on 1 Apr 2015 at 8:57
What's a token index then, simply the running number of the token like this?
1 this
2 is
3 a
4 test
?
Original comment by richard.eckart
on 1 Apr 2015 at 8:58
Yes. With distinct tokens only.
Original comment by schno...@ukp.informatik.tu-darmstadt.de
on 1 Apr 2015 at 8:59
Original comment by schno...@ukp.informatik.tu-darmstadt.de
on 1 Apr 2015 at 9:06
Sounds a bit like a n-gram writer with n set to 1 plus added indexes. The web1t
module contains such a n-gram writer and supports very large numbers of tokens.
Probably that can be easily extended to optionally prefix each tokenwith a
running number (and to omit the token count). The n-gram writer in web1t also
already supports parameters for indexing other information than the token text,
e.g. lemma.
Original comment by richard.eckart
on 1 Apr 2015 at 9:15
Yes, sounds right. I was not aware of that and will check whether it actually
has the functionality I need.
Original comment by schno...@ukp.informatik.tu-darmstadt.de
on 1 Apr 2015 at 9:19
Following the comments, this turns out to be unnecessary as a dedicated module.
Original comment by schno...@ukp.informatik.tu-darmstadt.de
on 10 Apr 2015 at 10:32
Original issue reported on code.google.com by
carsc...@gmail.com
on 31 Mar 2015 at 5:09