Closed damian0604 closed 5 years ago
The bottleneck indeed seems to be the estimation of the matrix within the for-loop.
Possible approaches:
Actually estimate the whole similiarty matrix for all possible comparisons even though we do not need all entries. This would move the estimation of the matrix out of the loop and leave only the query in. This may be less stupid than it sounds, according to https://radimrehurek.com/gensim/similarities/docsim.html#gensim.similarities.docsim.Similarity 1 million documents take only 1 GB of RAM, and one can specify a tmpfile for outsourcing to disk.
The same link to the documentation also specifies that it is possible to add and remove documents. So maybe it is an option to have a sliding (time) window add and remove the documents? So basically instead of re-estimating the matrix add new entries and remove earlier ones?
I'll think about it a bit more...
Addressed in PR #453
In the softcosine analysis, it is possible to limit the comparisons to be made with the days_before/days_after parameters. I found that this didn't work in the branch we pushed and I changed some things around (in the branch similarities2) and now it does work (as well as supplying multiple doctypes). However, I found that this method takes MUCH longer than calculating all the comparisons (see below), and then limiting them afterwards. The example below is done with 20 nu.nl articles, of which 10 are published on the 18th of October, and the other 10 are published on the 26th, which means that in theory it should be shorter.
I think it has to do with creating the similarity matrix, which takes a considerable amount of time. In the days_before/days_after method we create a list of relevant targets for each source. Then it has to go through creating the similarity matrix for each list. Perhaps starting this several times for shorter lists takes longer than starting it once for a longer list? However, I cannot think of another way to do this, any ideas?
Also, I found that in the output I have several sources without a target or similarity score, and I can't figure out where this is coming from...
Marieke