Closed kazunori279 closed 7 years ago
"from 3 million" is correct. Currently document search query over the table queryit-smart:stackoverflow.top3M_posts which contains 3 million highest score posts. And it's using TF-IDF to calculate similarity. May I reduce the posts on the table to eliminate low scored posts?
We want to show the performance on 3M rows, but want to improve the quality of posts as well :) So, please keep using 3M posts table and just select high quality posts (that is in the top 100K posts) by specifying a threshold on view count.
OK I understand it. I will prepare some BigQuery tables and tweak on the query.
I used score
to filter the posts. The view_counts
seems better?
Anyway, I'm going to prepare the tables based on each of them. We will try which seems better.
I didn't try score
, and view_counts
worked for me.
Done by #46 (Show only top 100K view count posts). I deployed the app to demo environment. The query tend to take longer, about 70 sec.
My fault. It took about 40 sec now. The BigQuery performance may be depressed temporary at that time. @kazunori279 Is it OK? or reduce records to calculate similarites to speed up?
I just tried running the following query (the original query I wrote for gcp-samples2 project) and getting the result in around 20sec from 10M rows every time. What's the difference with your query...?
CREATE TEMPORARY FUNCTION calc_similarity(tf_idf_json_0 STRING, tf_idf_json_1 STRING) RETURNS FLOAT64 LANGUAGE js AS """
// parse JSON to extract tf_idf var tf_idf_0 = JSON.parse(tf_idf_json_0); var tf_idf_1 = JSON.parse(tf_idf_json_1);
// calculate cosine similarity var similarity = 0; for (word in tf_idf_0) { var t0 = tf_idf_0[word] ? Number(tf_idf_0[word]) : 0; var t1 = tf_idf_1[word] ? Number(tf_idf_1[word]) : 0; similarity += t0 * t1; }
return similarity; """;
SELECT
title,
body,
tags,
similarity
FROM
(
SELECT
t1.id,
calc_similarity(tf_idf_0, t1.tf_idf) AS similarity
FROM
(
SELECT tf_idf AS tf_idf_0
FROM gcp-samples2.stackoverflow_demo.top10M_posts_tf_idf
AS t0
WHERE id = 92082
)
CROSS JOIN
gcp-samples2.stackoverflow_demo.top10M_posts_tf_idf
AS t1
ORDER BY
similarity DESC
LIMIT
10
)
JOIN
gcp-samples2.stackoverflow_demo.top10M_posts
AS t2
USING (id)
ORDER BY
similarity DESC
It's interesting. Just replacing the first two tables to refer the table in queryit-smart project (gcp-samples2.stackoverflow_demo.top10M_posts_tf_idf
-> queryit-smart.stackoverflow.top3M_posts_tf_idf
) slow down the query to take around 60sec. Weired.
CREATE TEMPORARY FUNCTION calc_similarity(tf_idf_json_0 STRING, tf_idf_json_1 STRING)
RETURNS FLOAT64
LANGUAGE js AS """
// parse JSON to extract tf_idf
var tf_idf_0 = JSON.parse(tf_idf_json_0);
var tf_idf_1 = JSON.parse(tf_idf_json_1);
// calculate cosine similarity
var similarity = 0;
for (word in tf_idf_0) {
var t0 = tf_idf_0[word] ? Number(tf_idf_0[word]) : 0;
var t1 = tf_idf_1[word] ? Number(tf_idf_1[word]) : 0;
similarity += t0 * t1;
}
return similarity;
""";
SELECT
title,
body,
tags,
similarity
FROM
(
SELECT
t1.id,
calc_similarity(tf_idf_0, t1.tf_idf) AS similarity
FROM
(
SELECT tf_idf AS tf_idf_0
FROM `queryit-smart.stackoverflow.top3M_posts_tf_idf` AS t0
WHERE id = 92082
)
CROSS JOIN
`queryit-smart.stackoverflow.top3M_posts_tf_idf` AS t1
ORDER BY
similarity DESC
LIMIT
10
)
JOIN
`gcp-samples2.stackoverflow_demo.top10M_posts` AS t2
USING (id)
ORDER BY
similarity DESC
Anyway using gcp-samples2.stackoverflow_demo.top10M_posts_tf_idf
for StackOverflow search tend to archive good performance. I will investigate deeper for a while.
great, thanks!!
Change to search over 10M posts at https://github.com/groovenauts/QueryItSmart/commit/1818da079995e4b38768eb35aba2582605769fab.
I think it's depend on table volume. BigQuery seems spin up workers based on data volume. I also changed the record numbers in the descriptions.
Now the search performance and result quality looks much better. Thanks so much!!
The quality of search result isn't great. Can you please fix these?