Feature Request: Support directly operating on files

voycey commented 5 years ago

One thing we have noticed during our Big Data Journey is that while DB / Query engines are a convenient way to get access to data, if you want to do any kind of meaningful heavy lifting it is sometimes better to skip them altogether.

One thing we have been working on lately is directly operating on ORC files via Spark.

For example - doing a simple table copy from an unpartitioned table into a partitioned table (insert into <x> select * from <y> or load data inpath...) with the size of data and number of partitions we have (50 x 90) = 4500 via Hive with several well powered nodes (16 CPU x 104G Memory) takes days to complete. (Even with stats generation turned off for inserted rows!).

However looping the same query in Spark over a dataset like this:

val day = spark.read.orc("gs://pipeline/unpartitioned/ts_date=2019-02-15/state_code=CA/")
day.write.option("orc.bloom.filter.columns","id,place_cat,place_name,city").option("orc.compress","ZLIB").option("orc.compress.size","262144").option("orc.create.index","true").option("orc.row.index.stride","50000").orc("gs://pipeline/partitioned/ts_date=2019-02-15/state_code=CA/")

Finishes in hours on a very small and basic 3 node standard cluster.

As scramble generation is basically a transmutation of the data in one table into another one - avoiding the overhead of these Query Engines would go a long way to making this more efficient.

Currently most of our plans for this has been in crippling the scramble generation somehow - whilst it is possible to get the scrambles created using a reduced partitioning system - it might not be necessary if you can take full advantage of an MPP system such as Spark directly (and not relying on the HiveContext).

What do you think?

voycey commented 5 years ago

Naturally this wouldn't just be limited to ORC. Spark supports all sorts of writers from Parquet and Avro to even CSV and JSON files!

pyongjoo commented 5 years ago

Thank you suggestion! I agree with you on the usefulness of directly operating on file(s), but it won't always be magical (in comparison to SQL-based ones).

I see several orthogonal things in your comment:

A possibility of operating directly on files (either compressed or raw files): This will be very useful. We keep debating about common use cases, and we will definitely work on this.
Spark is faster than Hive: I agree, but it's mainly due to Spark vs Hive (or Hive's legacy Hadoop engine), not due to Spark SQL vs Spark. Spark SQL is all translated into Spark's map-reduce functions eventually. Also, Spark SQL's external table simply works on files (e.g., ORC, Parquet, csv, etc).
Operating on files may allow a larger number of partitions: If so, it will be useful. But, there may be negative performance impact, which I don't have a good understanding.

Eventually, I think Verdict should support direct data loads from your local disk or from Google cloud storage, etc., without requiring you all messy data transformation.

voycey commented 5 years ago

Sorry I just want to clarify - I am only talking about scramble creation here - not necessarily running Verdict directly on files as its own "Query Engine".

I find that Tez (the default engine underneath Hive in more recent versions) is not great at Memory Management compared to MR and Spark as execution engines, we see jobs fail much more regularly - even jobs that have previously completed on the same hardware. The only reliable way of processing anything on our kind of datasets are by trying to actually "reduce" the amount of processing that has to be done.

For example - to convert to ORC at the moment - the "easiest" way to do this in terms of effort is to create an external table on the CSV files, create an ORC table and then insert into orc select * from csv Spark takes a bit more effort to get set up initially but can process the files much faster on a single node than Hive can do using a multi node cluster on the above.

I think for the time being - a way to simply create a scramble from an RDD / Dataframe and be able to write those to ORC files would be a great huge step - I don't think it would be too much effort to develop (we had a brief try but we have deadline to meet) as all the code to do this is basically there just using HiveContext to materialise them. If we can do the computationally difficult stuff such as creating scrambles in Spark and then keep the SQL interface so it can do all of the query translation that would be perfect :)

pyongjoo commented 5 years ago

@voycey Assuming you will be using Google BigQuery (or at least Google Cloud ecosystem), what do you think of this workflow? This is to figure out the simplest way for you to use VerdictDB. Once the interface is determined, we can engineer its internals for higher efficiency.

You store raw (or compressed) files in Google Cloud Storage. Let's say on Jan 1, you collected mydata_jan1.csv and on Jan 2, you collected mydata_jan2.csv, and so on.
You let VerdictDB load your data into its own some data store by pointing the locations of those files. The some data store will simply be BigQuery tables (managed by VerdictDB).
You run queries to VerdictDB.

voycey commented 5 years ago

BigQuery will be able to create the scrambles much faster than any other DBMS that's for sure, if everything can be run in there I think that is a good solution.

However some data store is likely to be problematic at the scale we are talking about, hence why I am suggesting working directly on the files for scramble creation.

Spark (Not SparkSQL) has been the only reliable method of processing bulk files at the scale we are at cheaply, the moment any kind of query engine is placed around it the overheads become problematic (we reduced our cluster sizes yesterday and some counts now take 4 hours)

It may not be possible without this query engine, I don't know? But the:

File.read().doSomething().write()

Workflow is something that has reliably worked for us at scale.

I think this is due to the fact that our data is partitioned, so we can run exactly the day & state we want.

On Wed, 5 Jun. 2019, 02:51 Yongjoo Park, notifications@github.com wrote:

@voycey https://github.com/voycey Assuming you will be using Google BigQuery (or at least Google Cloud ecosystem), what do you think of this workflow? This is to figure out the simplest way for you to use VerdictDB. Once the interface is determined, we can engineer its internals for higher efficiency.

You store raw (or compressed) files in Google Cloud Storage. Let's say on Jan 1, you collected mydata_jan1.csv and on Jan 2, you collected mydata_jan2.csv, and so on.

You let VerdictDB load your data into its own some data store by pointing the locations of those files. The some data store will simply be BigQuery tables (managed by VerdictDB).

You run queries to VerdictDB.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mozafari/verdictdb/issues/372?email_source=notifications&email_token=AAIEBCTMV7XW4UUPKHYAC43PY2MQBA5CNFSM4HPLOIO2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW5GDFY#issuecomment-498753943, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIEBCSNJZQNG5X3MBUYS2TPY2MQBANCNFSM4HPLOIOQ .

verdict-project / verdict

Feature Request: Support directly operating on files #372