Quality improvement on the document search

kazunori279 commented 7 years ago

The quality of search result isn't great. Can you please fix these?

The description says "from 30 million documents" but the search result says "from 3 million". Which is correct?
Show results only for the top 10K posts? (the posts that has the highest views). We keep the row number of source table to 30M or 3M.
Can you try my TF-IDF sample and compare the results? It looks my sample has better quality. https://github.com/kazunori279/cloudml-demo/blob/master/Similarity%2Bsearch%2Bon%2BStack%2BOverflow.ipynb
Can you move the "Finished. Processed..." to the search result page? It's not a kind of info that should be shown as a dialog, and yet it's important to show.
Can you change scaling of the score? It's too low for all results.

nagachika commented 7 years ago

"from 3 million" is correct. Currently document search query over the table queryit-smart:stackoverflow.top3M_posts which contains 3 million highest score posts. And it's using TF-IDF to calculate similarity. May I reduce the posts on the table to eliminate low scored posts?

kazunori279 commented 7 years ago

We want to show the performance on 3M rows, but want to improve the quality of posts as well :) So, please keep using 3M posts table and just select high quality posts (that is in the top 100K posts) by specifying a threshold on view count.

nagachika commented 7 years ago

OK I understand it. I will prepare some BigQuery tables and tweak on the query.

nagachika commented 7 years ago

I used score to filter the posts. The view_counts seems better? Anyway, I'm going to prepare the tables based on each of them. We will try which seems better.

kazunori279 commented 7 years ago

I didn't try score, and view_counts worked for me.

nagachika commented 7 years ago

Done by #46 (Show only top 100K view count posts). I deployed the app to demo environment. The query tend to take longer, about 70 sec.

nagachika commented 7 years ago

My fault. It took about 40 sec now. The BigQuery performance may be depressed temporary at that time. @kazunori279 Is it OK? or reduce records to calculate similarites to speed up?

kazunori279 commented 7 years ago

I just tried running the following query (the original query I wrote for gcp-samples2 project) and getting the result in around 20sec from 10M rows every time. What's the difference with your query...?

CREATE TEMPORARY FUNCTION calc_similarity(tf_idf_json_0 STRING, tf_idf_json_1 STRING) RETURNS FLOAT64 LANGUAGE js AS """

// parse JSON to extract tf_idf var tf_idf_0 = JSON.parse(tf_idf_json_0); var tf_idf_1 = JSON.parse(tf_idf_json_1);

// calculate cosine similarity var similarity = 0; for (word in tf_idf_0) { var t0 = tf_idf_0[word] ? Number(tf_idf_0[word]) : 0; var t1 = tf_idf_1[word] ? Number(tf_idf_1[word]) : 0; similarity += t0 * t1; }

return similarity; """;

SELECT title, body, tags, similarity FROM ( SELECT t1.id, calc_similarity(tf_idf_0, t1.tf_idf) AS similarity FROM ( SELECT tf_idf AS tf_idf_0 FROM gcp-samples2.stackoverflow_demo.top10M_posts_tf_idf AS t0 WHERE id = 92082 ) CROSS JOIN gcp-samples2.stackoverflow_demo.top10M_posts_tf_idf AS t1 ORDER BY similarity DESC LIMIT 10 ) JOIN gcp-samples2.stackoverflow_demo.top10M_posts AS t2 USING (id)
ORDER BY similarity DESC

nagachika commented 7 years ago

It's interesting. Just replacing the first two tables to refer the table in queryit-smart project (gcp-samples2.stackoverflow_demo.top10M_posts_tf_idf -> queryit-smart.stackoverflow.top3M_posts_tf_idf) slow down the query to take around 60sec. Weired.

CREATE TEMPORARY FUNCTION calc_similarity(tf_idf_json_0 STRING, tf_idf_json_1 STRING)
RETURNS FLOAT64
LANGUAGE js AS """

// parse JSON to extract tf_idf
var tf_idf_0 = JSON.parse(tf_idf_json_0);
var tf_idf_1 = JSON.parse(tf_idf_json_1);

// calculate cosine similarity
var similarity = 0;
for (word in tf_idf_0) {
var t0 = tf_idf_0[word] ? Number(tf_idf_0[word]) : 0;
var t1 = tf_idf_1[word] ? Number(tf_idf_1[word]) : 0;
similarity += t0 * t1;
}

return similarity;
""";

SELECT
title,
body,
tags,
similarity
FROM
(
SELECT
t1.id,
calc_similarity(tf_idf_0, t1.tf_idf) AS similarity
FROM
(
SELECT tf_idf AS tf_idf_0
FROM `queryit-smart.stackoverflow.top3M_posts_tf_idf` AS t0
WHERE id = 92082
)
CROSS JOIN
`queryit-smart.stackoverflow.top3M_posts_tf_idf` AS t1
ORDER BY
similarity DESC
LIMIT
10
)
JOIN
`gcp-samples2.stackoverflow_demo.top10M_posts` AS t2
USING (id)
ORDER BY
similarity DESC

nagachika commented 7 years ago

Anyway using gcp-samples2.stackoverflow_demo.top10M_posts_tf_idf for StackOverflow search tend to archive good performance. I will investigate deeper for a while.

kazunori279 commented 7 years ago

great, thanks!!

nagachika commented 7 years ago

Change to search over 10M posts at https://github.com/groovenauts/QueryItSmart/commit/1818da079995e4b38768eb35aba2582605769fab.

I think it's depend on table volume. BigQuery seems spin up workers based on data volume. I also changed the record numbers in the descriptions.

kazunori279 commented 7 years ago

Now the search performance and result quality looks much better. Thanks so much!!

groovenauts / QueryItSmart

Quality improvement on the document search #38