More information on co-occurence collector

glorieux-f commented 5 years ago

Hi,

I'm implementing a linguistic tool with lucene, and found your project with Google. http://lucene.472066.n3.nabble.com/Cooccurrence-matrices-td4296809.html I need some more information to understand your approach.

My project is not finished but some limes are designed. I'm a lot concerned by performances, I want to dig big corpora. I've implemented a lemmatizer (for French only) as an analyzer, with more than one token by position (lemma, pos).

For a concordancer, I read all the highlighter propositions and found them quite expensive. I've adopted a brute force approach, store a binary array of offsets by position for each document. So, when I found the positions I'm interested in with a query (i use termVector instead of spanQueries),I can easily get the offsets of the terms around the pivot, and display original context.

My problem now is to collect co-occurrences in a fast manner. I'm reading your code, but I'm sure that some explanations can help me to understand faster.

Tanks in advance for your advice.

glorieux-f commented 5 years ago

I'm not sure to have understood, it seems you re-analyze documents. org.tallison.lucene.search.concordance.charoffsets.ReanalyzingTokenCharOffsetsReader too expensive for me.

glorieux-f commented 5 years ago

By the way, I found some references to BitSet in your code (RandomAccessCharOffsetContainer, TokenCharOffsetRequests), but I'm not sure you are using it. What did I missed ?

tballison commented 5 years ago

Sorry for my delay. The cooccurrence code, as you pointed out, is not optimized for performance. It does perform re-analysis. Even on corpora of a few million documents, Lucene is fast enough for reasonable calculations in the UI for ~20k "windows"...occurrences and the co-occurrences.

At some point, I did a timing study of uninverting the index instead of reanalyzing, and reanalyzing was faster on most queries. This was not a robust performance comparison, and the results may be different now. The key underlying issue, as you know, is that Lucene's inverted index is not designed at all for co-occurrence stuff.

I did have a project once where I indexed shingles and created lucene documents with a target shingle in the "key" field, and then a field for the word(s) that appeared immediately before it and a field immediately after it. This allowed for crazily fast computation, of course, but the downside is that the index is huge. On the upside, I could do word2vec-like computations on ngram cooccurrence similarity 10 years ago.

By the way, I found some references to BitSet in your code (RandomAccessCharOffsetContainer, TokenCharOffsetRequests), but I'm not sure you are using it. What did I missed ?

This may be a relic of earlier code. I'll take a look. The IDE for this project when it started was Notepad, and it was originally used with Lucene 2.9 iirc.

glorieux-f commented 5 years ago

I started lucene with Notepad also.

word2vec-like computations on ngram cooccurrence similarity You get it, that's my goal, the problem is give it as a webapp. I have already code doing that, but without a good persistence.

Lucene's inverted index is not designed at all for co-occurrence stuff Maybe I will have to change my mind and follow your way, but I'm trying another approach, to provide a word cloud of the co-occurrences. Instead of looping on found docs, updating frequencies after each doc, I will loop on global list of terms of a field, in frequency order. The user may feel it faster, because words could be send to him with final frequency (progressive wordcloud). Drawback, he will have to wait for specific terms, low in global frequency list, but high in the co-ocurrences.

tballison commented 5 years ago

You get it, that's my goal, the problem is give it as a webapp. I have already code doing that, but without a good persistence.

Should I offer more details of my algorithm for that, or are you good?

glorieux-f commented 5 years ago

Should I offer more details of my algorithm for that, or are you good?

Thanks, but I'm not yet ready to ingest things.

tballison / lucene-addons

More information on co-occurence collector #56