Softcosine: performance issues

damian0604 commented 6 years ago

In the softcosine analysis, it is possible to limit the comparisons to be made with the days_before/days_after parameters. I found that this didn't work in the branch we pushed and I changed some things around (in the branch similarities2) and now it does work (as well as supplying multiple doctypes). However, I found that this method takes MUCH longer than calculating all the comparisons (see below), and then limiting them afterwards. The example below is done with 20 nu.nl articles, of which 10 are published on the 18th of October, and the other 10 are published on the 26th, which means that in theory it should be shorter.

I think it has to do with creating the similarity matrix, which takes a considerable amount of time. In the days_before/days_after method we create a list of relevant targets for each source. Then it has to go through creating the similarity matrix for each list. Perhaps starting this several times for shorter lists takes longer than starting it once for a longer list? However, I cannot think of another way to do this, any ideas?

Also, I found that in the output I have several sources without a target or similarity score, and I can't figure out where this is coming from...

Marieke

In [3]: myinca.analysis.softcosine_similarity.fit('/media/sf_virtualbox_folder/mymodel', 'nu', 'nu', days_before=1, d
   ...: ays_after=1, to_csv=True)
INFO:INCA:The results of the similarity analysis could be inflated when not using the recommended text processing steps (stopword removal, punctuation removal, stemming) beforehand
100%|███████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 100.21it/s]
100%|███████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 262.44it/s]
2018-11-06 16:20:57.917742
1 out of 19
2 out of 19
3 out of 19
4 out of 19
5 out of 19
6 out of 19
7 out of 19
8 out of 19
9 out of 19
10 out of 19
11 out of 19
12 out of 19
13 out of 19
14 out of 19
15 out of 19
16 out of 19
17 out of 19
18 out of 19
19 out of 19
2018-11-06 16:24:55.709522

In [4]: myinca.analysis.softcosine_similarity.fit('/media/sf_virtualbox_folder/mymodel', 'nu', 'nu', to_csv=True)
INFO:INCA:The results of the similarity analysis could be inflated when not using the recommended text processing steps (stopword removal, punctuation removal, stemming) beforehand
100%|███████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 266.76it/s]
100%|███████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 384.62it/s]
2018-11-06 16:28:50.159596
2018-11-06 16:29:11.789980

damian0604 commented 6 years ago

The bottleneck indeed seems to be the estimation of the matrix within the for-loop.

Possible approaches:

Actually estimate the whole similiarty matrix for all possible comparisons even though we do not need all entries. This would move the estimation of the matrix out of the loop and leave only the query in. This may be less stupid than it sounds, according to https://radimrehurek.com/gensim/similarities/docsim.html#gensim.similarities.docsim.Similarity 1 million documents take only 1 GB of RAM, and one can specify a tmpfile for outsourcing to disk.
The same link to the documentation also specifies that it is possible to add and remove documents. So maybe it is an option to have a sliding (time) window add and remove the documents? So basically instead of re-estimating the matrix add new entries and remove earlier ones?

I'll think about it a bit more...

mariekevh commented 5 years ago

Addressed in PR #453

uvacw / inca

Softcosine: performance issues #450