verdict-project / verdict

Interactive-Speed Analytics: 200x Faster, 200x Fewer Cluster Resources, Approximate Query Processing
http://verdictdb.org
Apache License 2.0
248 stars 66 forks source link

Spark Scramble Appending #374

Open commercial-hippie opened 5 years ago

commercial-hippie commented 5 years ago

I haven't tested the scramble creation using Spark but appending to a scramble generates a invalid query.

The following error happens..

mismatched input '`partner`' expecting {'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', 'REDUCE'} (line 1, pos 61)

caused by the following query (stripped out some columns to reduce the length)..

insert into `verdict_scrambles`.`partitioned_flattened_orc` (`partner`,`src`,`ts`,`id`,`verdictdbtier`,`state_code`,`verdictdbblock`) select `partner`,`src`,`ts`,`id`,`verdictdbtier`,`state_code`,`verdictdbblock` from `verdict_scrambles`.`verdictdbtemp_dWH1w8rA`

This is because Spark doesn't support column lists in insert statements.

So instead of:

INSERT into tablename (col1, col2, col3) VALUES ('1', '2','3')

We need to use:

INSERT into tablename VALUES ('1', '2','3')

I've currently hacked around it to get it working temporarily.

Info 1 Info 2

pyongjoo commented 5 years ago

Thanks for letting us know. Right now, I'm more leaning toward a non-SQL approach for scramble creation. I found Google Dataflow (backed by Apache Beam) could be a good solution both for generality and scalability. FYI, it can also easily work with files directly (as Dan suggested).

Let me know what you think.

voycey commented 5 years ago

I agree that non-SQL is the way forward for this as much as possible

https://beam.apache.org/documentation/io/built-in/

BigQuery is great as an option - although handling this you will need to be aware of the query costs involved - unless it is crafted well they can spiral out of control pretty easily, when you have 60TB of data in there - this all of a sudden becomes an expensive query ($300 per query execution).

Avro & Parquet I/O should definitely also be considered (ORC will be important for us but it seems development has stalled but we are good with BQ for now: https://issues.apache.org/jira/browse/BEAM-1861) As that way you are avoiding any kind of 3rd party processes and the results can be moved to whatever query engine is required as a simple file transfer

On Thu, 6 Jun 2019 at 03:24, Yongjoo Park notifications@github.com wrote:

Thanks for letting us know. Right now, I'm more leaning toward a non-SQL approach for scramble creation. I found Google Dataflow (backed by Apache Beam) https://beam.apache.org/documentation/io/built-in/google-bigquery/ could be a good solution both for generality and scalability. FYI, it can also easily work with files directly (as Dan suggested).

Let me know what you think.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mozafari/verdictdb/issues/374?email_source=notifications&email_token=AAIEBCWRSZWUMS4GKH5C6PTPY7ZDFA5CNFSM4HP3E4OKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXANSIA#issuecomment-499177760, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIEBCSPA4EAWBHRQVBRKNTPY7ZDFANCNFSM4HP3E4OA .

-- __ Mob: (+61) 0481 301 640 Skype: daniel.voyce LinkedIn: https://www.linkedin.com/in/danielvoyce/

pyongjoo commented 5 years ago

For BQ, I think Verdict team will use flat-rate pricing ($10K/month) in the future, so its users can basically run unlimited queries on unlimited-scale data. Yet, this is still down the road.

voycey commented 5 years ago

I don't know how that would work? Custodial transfer of data isn't a good idea! It should be cheap enough in BQ vs Spark if the queries are designed correctly

On Fri, 7 Jun 2019 at 01:32, Yongjoo Park notifications@github.com wrote:

For BQ, I think Verdict team will use flat-rate pricing ($10K/month) in the future, so its users can basically run unlimited queries on unlimited-scale data. Yet, this is still down the road.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mozafari/verdictdb/issues/374?email_source=notifications&email_token=AAIEBCXAYMUJMOST77ORLPLPZEUZDA5CNFSM4HP3E4OKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXDHRRA#issuecomment-499546308, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIEBCWWMPZ6R3YUGS54AADPZEUZDANCNFSM4HP3E4OA .

-- __ Mob: (+61) 0481 301 640 Skype: daniel.voyce LinkedIn: https://www.linkedin.com/in/danielvoyce/

pyongjoo commented 5 years ago

Your datasets in BQ can be accessed by a third-party service if you grant access: https://cloud.google.com/bigquery/docs/share-access-views

I'm double-checking a billing-related question, i.e., which party is charged for queries.

voycey commented 5 years ago

I think for data provenance reasons this wont be possible - especially in light of the new privacy laws coming in from California.

We are happy to pay for the processing of our own data - we just need the tools to be able to do it :)

pyongjoo commented 5 years ago

@voycey It would be great if you can elaborate on data provenance issue. My naive thought was if Google allows the operation (e.g., viewing data without copying them), it must be legal (under the assumption that the data provider grants).

Regarding partitioning, I think the easiest way is to let you specify partition columns for scrambles. Currently, scrambles inherit the partitioning of an original table (thus, the number exploits in combination with verdictdbblock). I believe a combination of (date, verdictdbblock) should good enough for ensuring speed (without state).

In the future, we can reduce the total number of verdictdbblock down to 10 or so, by introducing a clever scheme, but it will take some time (since we are transitioning...)

voycey commented 5 years ago

There are many use cases where this would be great, however when dealing with user information allowing 3rd party access to this data you are essentially not only opening up a secondary vector but you are also technically sharing information without notifying the users, this would breach GDPR and the CCPA (I'm sure there are load of other reasons why this wouldn't be allowed as well but that is the first one that pops into my head).

I think if we can just handle the partitioning by date that will be great for this - as the data is only 5% of the total there is little need to have a full granular partitioning scheme.

In the meantime - generating scrambles (in whatever fashion) in BigQuery would be great - BQ simply handles the capacity requirements and they would probably complete very quickly!