uvacw / inca

24 stars 6 forks source link

Softcosine: performance issues #450

Closed damian0604 closed 5 years ago

damian0604 commented 6 years ago

In the softcosine analysis, it is possible to limit the comparisons to be made with the days_before/days_after parameters. I found that this didn't work in the branch we pushed and I changed some things around (in the branch similarities2) and now it does work (as well as supplying multiple doctypes). However, I found that this method takes MUCH longer than calculating all the comparisons (see below), and then limiting them afterwards. The example below is done with 20 nu.nl articles, of which 10 are published on the 18th of October, and the other 10 are published on the 26th, which means that in theory it should be shorter.

I think it has to do with creating the similarity matrix, which takes a considerable amount of time. In the days_before/days_after method we create a list of relevant targets for each source. Then it has to go through creating the similarity matrix for each list. Perhaps starting this several times for shorter lists takes longer than starting it once for a longer list? However, I cannot think of another way to do this, any ideas?

Also, I found that in the output I have several sources without a target or similarity score, and I can't figure out where this is coming from...

Marieke

In [3]: myinca.analysis.softcosine_similarity.fit('/media/sf_virtualbox_folder/mymodel', 'nu', 'nu', days_before=1, d
   ...: ays_after=1, to_csv=True)
INFO:INCA:The results of the similarity analysis could be inflated when not using the recommended text processing steps (stopword removal, punctuation removal, stemming) beforehand
100%|███████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 100.21it/s]
100%|███████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 262.44it/s]
2018-11-06 16:20:57.917742
1 out of 19
2 out of 19
3 out of 19
4 out of 19
5 out of 19
6 out of 19
7 out of 19
8 out of 19
9 out of 19
10 out of 19
11 out of 19
12 out of 19
13 out of 19
14 out of 19
15 out of 19
16 out of 19
17 out of 19
18 out of 19
19 out of 19
2018-11-06 16:24:55.709522

In [4]: myinca.analysis.softcosine_similarity.fit('/media/sf_virtualbox_folder/mymodel', 'nu', 'nu', to_csv=True)
INFO:INCA:The results of the similarity analysis could be inflated when not using the recommended text processing steps (stopword removal, punctuation removal, stemming) beforehand
100%|███████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 266.76it/s]
100%|███████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 384.62it/s]
2018-11-06 16:28:50.159596
2018-11-06 16:29:11.789980
damian0604 commented 6 years ago

The bottleneck indeed seems to be the estimation of the matrix within the for-loop.

Possible approaches:

I'll think about it a bit more...

mariekevh commented 5 years ago

Addressed in PR #453