osome-iu / hoaxy-backend

Backend component for Hoaxy, a tool to visualize the spread of claims and fact checking
http://hoaxy.iuni.iu.edu/
GNU General Public License v3.0
139 stars 44 forks source link

Speed of query retrieval #10

Closed filmenczer closed 6 years ago

filmenczer commented 6 years ago

We noticed that the search from the article search engine (Lucene) is often very slow. Can you please do some tests with random queries? We should find the bottleneck: is it Lucene? the database? the API? the network?

shaochengcheng commented 6 years ago

Hi Fil,

I tested the API flow with random words selected from the title of articles. Here are the time escaped in the API flow (in seconds, 100 rounds):

t0_lucene_query query from article 0.554636 t1_article_filtering filtering disabled site 0.004863 t2_article_sharing query twitter sharing of the article. 11.400047 t3_network_building_old build network by old api 21.456027 t4_network_buiding_new build network by new api . 15.631910

Please note that this testing is running on the server directly without the mashape middleware and only background data flow without front-end part.

The sum of first three items is about our first step in the front-end. And third and four item is about the second step in the front-end. As you can see, Lucene itself is really fast. The problem is that the query of the database does take tens seconds. As you can see that the new network API did have better performance.

The possible solution could be indexing and partition the database. However, I am not a database expert. Unfortunately, I cannot make much progress on the performance.

Thanks Chengcheng

filmenczer commented 6 years ago

Thank you @shaochengcheng that explains clearly --- I attributed the delay of the first phase to Lucene when in fact it is the retrieval of the tweets.

I wonder if we could speed up tweet retrieval by better indexing. @glciampaglia can we discuss this?

Also I understand that the network API is faster now. Thank you for that too! I expected a larger speedup because I thought that the network API now uses the edge table (per issue #4)? Is that a separate issue still being worked on?

glciampaglia commented 6 years ago

Thank you @shaochengcheng for running this analysis. This explains the bottleneck perfectly. I think that adding indexing to the article_sharing query could speed up things significantly, like what happened with the Botometer database. I can work on it. Could you please point me to the source code of the article_sharing query? What about the new API? Is it also an SQL query, or are you still parsing things in Python? Perhaps we could add indexes there too.

@filmenczer let's talk about this on Monday, if you are around.

filmenczer commented 6 years ago

@shaochengcheng -- the table ass_tweet_url had an index (tweet_id, url_id). Therefore when querying by url, it was not using the index, therefore it was slow. Giovanni and I created a new index (url_id, tweet_id). In this way, when querying by url, this is executed as an index scan and is MUCH faster!!!

Please update the code that creates the table to add this new index, than you can close this issue. Thanks!

shaochengcheng commented 6 years ago

Hi @filmenczer and @glciampaglia

I am not sure whether we need an extra index on table ass_tweet_url, because there is a unique constraint on it when creating. Let us look at the table info:

hoaxy=> \d+ ass_tweet_url
                                      Table "public.ass_tweet_url"
  Column  |  Type   |                         Modifiers                          | Storage | Description
----------+---------+------------------------------------------------------------+---------+-------------
 id       | integer | not null default nextval('ass_tweet_url_id_seq'::regclass) | plain   |
 tweet_id | integer |                                                            | plain   |
 url_id   | integer |                                                            | plain   |
Indexes:
    "ass_tweet_url_pkey" PRIMARY KEY, btree (id)
    "tweet_url_uq" UNIQUE, btree (tweet_id, url_id)
    "url_tweet" btree (url_id, tweet_id)
Foreign-key constraints:
    "ass_tweet_url_tweet_id_fkey" FOREIGN KEY (tweet_id) REFERENCES tweet(id) ON UPDATE CASCADE ON DELETE CASCADE
    "ass_tweet_url_url_id_fkey" FOREIGN KEY (url_id) REFERENCES url(id) ON UPDATE CASCADE ON DELETE CASCADE
Has OIDs: no

As you can see, the index is already there, "tweet_url_uq" UNIQUE, btree (tweet_id, url_id). And according to PostgreSQL docs

One should, however, be aware that there's no need to manually create indexes on unique columns; doing so would just duplicate the automatically-created index.

Thus I think, table ass_tweet_url does not need a manual index.

Am I right?

Thanks Chengcheng

filmenczer commented 6 years ago

Giovanni will answer more definitely, but as I recall:

So I think that the index was needed.

glciampaglia commented 6 years ago

The index is composite so when you look up a row by URL ID you are doing a partial lookup. However, with b-tree indexes (like the one in that table) this only works if you are using the leftmost part of the index. In other words, the index was being used when the reference was the tweet-ID, but not the other way round. Adding another index (URL ID, Tweet ID), does the trick.

glciampaglia commented 6 years ago

Btw Clayton pointed out that hash indexes would be even faster than b-tree indexes. I am not sure we need the extra speed at the moment though.

glciampaglia commented 6 years ago

Closed via 3dea3217e94bef6f3227638d555757a3155ca54b