groovenauts / QueryItSmart

QueryIt Smart is the demonstration application for BigQuery & Cloud Machine Learning.
MIT License
86 stars 16 forks source link

Quality improvement on the document search #38

Closed kazunori279 closed 7 years ago

kazunori279 commented 7 years ago

The quality of search result isn't great. Can you please fix these?

nagachika commented 7 years ago

"from 3 million" is correct. Currently document search query over the table queryit-smart:stackoverflow.top3M_posts which contains 3 million highest score posts. And it's using TF-IDF to calculate similarity. May I reduce the posts on the table to eliminate low scored posts?

kazunori279 commented 7 years ago

We want to show the performance on 3M rows, but want to improve the quality of posts as well :) So, please keep using 3M posts table and just select high quality posts (that is in the top 100K posts) by specifying a threshold on view count.

nagachika commented 7 years ago

OK I understand it. I will prepare some BigQuery tables and tweak on the query.

nagachika commented 7 years ago

I used score to filter the posts. The view_counts seems better? Anyway, I'm going to prepare the tables based on each of them. We will try which seems better.

kazunori279 commented 7 years ago

I didn't try score, and view_counts worked for me.

nagachika commented 7 years ago

Done by #46 (Show only top 100K view count posts). I deployed the app to demo environment. The query tend to take longer, about 70 sec.

nagachika commented 7 years ago

My fault. It took about 40 sec now. The BigQuery performance may be depressed temporary at that time. @kazunori279 Is it OK? or reduce records to calculate similarites to speed up?

kazunori279 commented 7 years ago

I just tried running the following query (the original query I wrote for gcp-samples2 project) and getting the result in around 20sec from 10M rows every time. What's the difference with your query...?

CREATE TEMPORARY FUNCTION calc_similarity(tf_idf_json_0 STRING, tf_idf_json_1 STRING) RETURNS FLOAT64 LANGUAGE js AS """

// parse JSON to extract tf_idf var tf_idf_0 = JSON.parse(tf_idf_json_0); var tf_idf_1 = JSON.parse(tf_idf_json_1);

// calculate cosine similarity var similarity = 0; for (word in tf_idf_0) { var t0 = tf_idf_0[word] ? Number(tf_idf_0[word]) : 0; var t1 = tf_idf_1[word] ? Number(tf_idf_1[word]) : 0; similarity += t0 * t1; }

return similarity; """;

SELECT title, body, tags, similarity FROM ( SELECT t1.id, calc_similarity(tf_idf_0, t1.tf_idf) AS similarity FROM ( SELECT tf_idf AS tf_idf_0 FROM gcp-samples2.stackoverflow_demo.top10M_posts_tf_idf AS t0 WHERE id = 92082 ) CROSS JOIN gcp-samples2.stackoverflow_demo.top10M_posts_tf_idf AS t1 ORDER BY similarity DESC LIMIT 10 ) JOIN gcp-samples2.stackoverflow_demo.top10M_posts AS t2 USING (id)
ORDER BY similarity DESC

nagachika commented 7 years ago

It's interesting. Just replacing the first two tables to refer the table in queryit-smart project (gcp-samples2.stackoverflow_demo.top10M_posts_tf_idf -> queryit-smart.stackoverflow.top3M_posts_tf_idf) slow down the query to take around 60sec. Weired.

CREATE TEMPORARY FUNCTION calc_similarity(tf_idf_json_0 STRING, tf_idf_json_1 STRING)
RETURNS FLOAT64
LANGUAGE js AS """

// parse JSON to extract tf_idf
var tf_idf_0 = JSON.parse(tf_idf_json_0);
var tf_idf_1 = JSON.parse(tf_idf_json_1);

// calculate cosine similarity
var similarity = 0;
for (word in tf_idf_0) {
var t0 = tf_idf_0[word] ? Number(tf_idf_0[word]) : 0;
var t1 = tf_idf_1[word] ? Number(tf_idf_1[word]) : 0;
similarity += t0 * t1;
}

return similarity;
""";

SELECT
title,
body,
tags,
similarity
FROM
(
SELECT
t1.id,
calc_similarity(tf_idf_0, t1.tf_idf) AS similarity
FROM
(
SELECT tf_idf AS tf_idf_0
FROM `queryit-smart.stackoverflow.top3M_posts_tf_idf` AS t0
WHERE id = 92082
)
CROSS JOIN
`queryit-smart.stackoverflow.top3M_posts_tf_idf` AS t1
ORDER BY
similarity DESC
LIMIT
10
)
JOIN
`gcp-samples2.stackoverflow_demo.top10M_posts` AS t2
USING (id)
ORDER BY
similarity DESC
nagachika commented 7 years ago

Anyway using gcp-samples2.stackoverflow_demo.top10M_posts_tf_idf for StackOverflow search tend to archive good performance. I will investigate deeper for a while.

kazunori279 commented 7 years ago

great, thanks!!

nagachika commented 7 years ago

Change to search over 10M posts at https://github.com/groovenauts/QueryItSmart/commit/1818da079995e4b38768eb35aba2582605769fab.

I think it's depend on table volume. BigQuery seems spin up workers based on data volume. I also changed the record numbers in the descriptions.

kazunori279 commented 7 years ago

Now the search performance and result quality looks much better. Thanks so much!!