Corpus × corpus distances

Witiko commented 5 years ago

Have you considered speed improvements that might result from computing distances between two corpora (queries and document collection)? With cosine similarity, this is simply a dot product between two term-document matrices. With network flows, perhaps a large network, where the same words in different documents would be distinct nodes, could be constructed? Running a for cycle over nearest_neighbors is not very fast even with the heuristics you implemented.

vmarkovtsev commented 5 years ago

Thanks for an interesting idea! There can be a problem with building a big network though: EMD calculation chokes on 1,000 and becomes unpractically slow on 2,000 - the cubic complexity is to blame. On the other side, there are tricks to reduce the problem size, and we also can subsample by significance. Are you aware of any papers about this?

Witiko commented 5 years ago

I was thinking of computing the subset graph of all document sets (i.e. BOW with ones and zeros) in O(N² / log N) (see Yellin and Jutla, 1993), where N is the sum of set cardinalities, i.e. N = |D| * |V|, where |D| is the number of documents and |V| is the size of the dictionary. By Heap's Law, |V| ≈ √|D|, i.e. the time complexity is O(|D|³ / log |D|). However, this is asymptotically slower than the O(|D|² |V| log |V|) ≈ O(|D|^2.5 / log |D|) time currently required to compute all pairwise distances (not to mention that the number of words two documents have in common will be less than |V|), so this still needs more thinking through.

If we had the subset graph, then for every nearest_neighbors call, we could filter out strict subsets of strict subsets of the query, unless this leaves us with less than k documents. This is correct, because a missing word cannot decrease the word mover's distance if the cost is a metric. I do not have an estimate on how many documents this filters out, but intuitively, this would be a large portion of D.

Witiko commented 5 years ago

Although an explicit subset graph construction seems unfeasible, there is a linear algebra procedure, which allows us to filter out strict subsets of strict subsets of the query (see paragraph 2 of above post):

Take sign(A)^T · sign(B), where A and B are term-document matrices, divide each column of the result by (∑_j sign(A_ij))_i^T, assign 1 to cells where division by zero takes place, floor the resulting matrix, and let the result be M. Take sign(A)^T · sign(B), divide each row of the result by (∑_j sign(B_ij))_i^T, assign 1 to cells where division by zero (0 / 0) takes place, floor the resulting matrix, and let the result be N. Then a document j in B is a subset of document i in A iff N_ij = 1 and a strict subset iff M_ij = 0 and N_ij = 1.

Let Q and C be the term-document matrices for the queries and for the collection documents, respectively. Computing M and N twice, once for A=Q and B=C and again for A=C and B=C, gives the worst-case time complexity O(|D|²|V|) ≈ O(|D|^2.5). In theory, this is still not an improvement over O(|D|² |V| log |V|) ≈ O(|D|^2.5 / log |D|), although an optimized implementation of matrix dot product can well make this worthwhile.

I would like to benchmark this in the future.

src-d / wmd-relax

Corpus × corpus distances #47