verdict-project / verdict

Interactive-Speed Analytics: 200x Faster, 200x Fewer Cluster Resources, Approximate Query Processing
http://verdictdb.org
Apache License 2.0
248 stars 66 forks source link

Incorrect caclutation for join case #387

Open yanshuyuan opened 5 years ago

yanshuyuan commented 5 years ago

First, I create the test data

CREATE TABLE ltp.sales ("user_id int,  product   varchar(100),  price     double);  
CREATE TABLE ltp.profile (user_id int, gender varchar(2), age int);

Then I mock some data for them

List<String> productList = Arrays.asList("milk", "egg", "juice");
        for (int i = 0; i < 300000; i++) {
            int randInt = ThreadLocalRandom.current().nextInt(0, 3);
            String product = productList.get(randInt);
            double price = (randInt + 2) * 10
                    + ThreadLocalRandom.current().nextInt(0, 10);
            stmt.execute(String
                    .format("INSERT INTO ltp.sales (user_id, product, price) VALUES(%d, '%s', %.0f)",
                            i, product, price));
        }

List<String> genderList = Arrays.asList("F", "M");
        for (int i = 0; i < 300000; i++) {
            int randInt = ThreadLocalRandom.current().nextInt(0,
                    genderList.size());
            String gender = genderList.get(randInt);
            int age = ((randInt + 2) * 10
                    + ThreadLocalRandom.current().nextInt(0, 10)) % 5;
            stmt.execute(String
                    .format("INSERT INTO ltp.profile (user_id, gender, age) VALUES(%d, '%s', %d)",
                            i, gender, age));
        }

Now, wo create the scramble:

CREATE SCRAMBLE ltp.profile_uniform_scrambled from ltp.profile METHOD UNIFORM ON user_id RATIO 0.2 BLOCKSIZE 1000
CREATE SCRAMBLE ltp.profile_hash_scrambled from ltp.profile METHOD HASH ON user_id RATIO 0.2 BLOCKSIZE 1000;
CREATE SCRAMBLE ltp.sales_uniform_scrambled from ltp.sales METHOD UNIFORM ON user_id RATIO 0.2 BLOCKSIZE 1000;
CREATE SCRAMBLE ltp.sales_hash_scrambled from ltp.sales METHOD HASH ON user_id RATIO 0.2 BLOCKSIZE 1000;

I start to estimate:

SELECT 
        product, 
        SUM(price) as total_price
    FROM (
        SELECT
            product,
            user_id,
            SUM(price) as price
        FROM 
            ltp.sales
        GROUP BY
            product, 
            user_id
        HAVING
            SUM(price) > 10
    ) t1 INNER JOIN (
        SELECT 
            user_id
        FROM 
            ltp.profile
        WHERE 
            age IN (1, 2, 3)
            AND 
            gender = 'F'
    ) t2
    ON t1.user_id = t2.user_id
    GROUP BY product

the result gap is so large, that's why?

pyongjoo commented 5 years ago

Sorry for the late reply. To support complex queries in a more principled way, we are working on a new system.