get_articles is slow (and fetching inlinks/outlinks seem the ones to blame)

semanticize / semanticizer

Entity Linking for the masses

http://semanticize.uva.nl/

GNU General Public License v3.0

56 stars 15 forks source link

get_articles is slow (and fetching inlinks/outlinks seem the ones to blame) #43

Open graus opened 10 years ago

graus commented 10 years ago

Retrieving inlinks & outlinks of Wikipedia pages is very slow, for a single query with a couple of hundred ids it can easily exceed 20 seconds (I'm timing everything inside get_articles(self, *pids)).

I don't understand why, everything is in redis, requests seem to be quick from redis-client, and the requests are similar to fetching the ID's labels (and the labels requests go fast, < 0.3 second for the same number of ids). How could this be?

larsmans commented 10 years ago

Did you try line_profiler on the function? That should show the problem, if it is CPU-bound.

graus commented 10 years ago

I haven't yet, will take a look, thanks! I did experiment with setting a hard limit on the number of links to retrieve, and that does make a difference (i.e., with a limit at 500 items, I get a speed increase of around a factor two).

graus commented 10 years ago

(which leads me to believe redis doesn't handle the large values well)

dodijk commented 10 years ago

@bartsidee got some nice speed improvement with Redis pipelines:

By default zijn de pipelines in redis python atomic, oftewel een blokkerend request. Ik heb de pipelines non-atomic mode gezet pipeline(transaction=False) met dit hielp de performance op redis (waarbij meerdere request worden afgevuurd) een heel stuk te verbeteren.