Open mrocklin opened 1 year ago
@mrocklin I'm looking into trying to get this working in my AWS account but have been having a little trouble due to the vCPU limit, but I'm looking to have that increased now.
So far, what I've done to start building out this solution with the trouble above is to use a Dask LocalCluster and only play with a single partition from your notebook, which speaks volumes as I have what I would consider a pretty powerful desktop and am unable to work across the entire nyc-tlc/trip data/fhvhv_tripdata_*.parquet
data set locally.
Working with this set, I've repartitioned the DataFrame and leveraged more efficient data types, as you demonstrate in your notebook. Based on Snowflake documentation regarding File Sizing Best Practices and Limitations, in my first test I used a partition size of 512 MB, which led to the creation of Parquet files that were around ~130 MB each.
This is where I tweaked some code within the "Repartition to smaller chunks!" section of the notebook, that looked like this:
# Taking only the first partition due to current limitations.
df = df.partitions[0]
df = df.repartition(partition_size="512MB").persist()
name_function = lambda x: f"fhvhv_tripdata_{x}.parquet"
df.to_parquet(path="s3::mybucket/", engine="pyarrow", compression="snappy", name_function=name_function)
To take this a step further, I wanted to continue to leverage cloud storage as the current notebook is doing, so I wrote these files to S3 and configured a Storage Integration and an External Stage in Snowflake using the following code, edited for sensitivity:
USE ROLE ACCOUNTADMIN;
CREATE STORAGE INTEGRATION S3_INT
TYPE = EXTERNAL_STAGE
STORAGE_PROVIDER = 'S3'
ENABLED = TRUE
STORAGE_AWS_ROLE_ARN = 'role_arn'
STORAGE_ALLOWED_LOCATIONS = ('s3://mybucket/');
DESCRIBE STORAGE INTEGRATION S3_INT;
GRANT USAGE ON INTEGRATION S3_INT TO ROLE SYSADMIN;
USE ROLE SYSADMIN;
CREATE STAGE FHVHV
STORAGE INTEGRATION = S3_INT
URL = 's3://mybucket/'
FILE_FORMAT = (TYPE = PARQUET);
I ran a quick test to ensure permissions were set correctly between the Snowflake Account and the S3 Bucket by listing the files from the stage.
With this in place, I wanted to leverage Snowpark to see about starting to work with this table using Snowflake. Going back to Python, I wrote some code to explore the data.
from snowflake.snowpark import Session
import json
session = Session.builder.configs(
json.loads(open("secrets.json").read())["SNOWFLAKE_CONNECTION"]
).create()
session.read.parquet("@FHVHV").show()
The result of the show command took around 15 seconds to complete on a cold X-Small warehouse.
Rather than reading the contents across the network using the stage, I persisted this data to a Snowflake table, still using the same X-Small warehouse.
session.read.parquet("@FHVHV").write.save_as_table("FHVHV")
This was completed in about 1 minute and 12 seconds.
Now that the table has persisted, the entire table is compressed to 462.7MB inside Snowflake, still retaining the same contents. I thought this was interesting as the separate compressed snappy files were ~664 MB on their own. I reran another show command using the table on the same X-Small warehouse, which was completed in 1.4 seconds.
We discussed looking at a workflow management system based on the idea of assuming a new file lands at a specific cadence, and we're leveraging cloud storage; As part of the puzzle, I think we might be able to leverage a Snowpipe to automate this ingestion piece once the data has been optimized upstream and landed in cloud storage. I might like to try to piece this together.
I would be interested in benchmarking this more with differently-sized files and various-sized warehouses to see if there is additional performance gain on the Snowflake side. I was only using an X-Small warehouse, but I would anticipate additional performance scaling up.
I hope we can test this with the entire dataset using Coiled soon. Another route I might like to explore is trying different file types aside from Parquet.
@mrocklin I'm looking into trying to get this working in my AWS account but have been having a little trouble due to the vCPU limit, but I'm looking to have that increased now.
I've added you to my account. Add account="mrocklin"
to your coiled.Cluster
command and you should be good.
Potential Next Steps
I'm also curious about the difference in querying between coiled and snowflake for typical queries. I wouldn't be surprised to learn that Snowflake was faster/easier given that it's more specialized for this job. It might be useful to verify.
I agree that it would be useful to start looking into keeping this updated. Presumably there is a new file every month. What do we set up and where to make sure that we keep this warehouse up to date? My tools of choice here would be either Github actions or Prefect. @dchudz any ideas?
My tools of choice here would be either Github actions or Prefect. @dchudz any ideas?
I don't have much view. Many good options. Prefect, Dagster, Github Actions, GCP Cloud Function, AWS Lambda, ... .
(I might personally go with one of the latter two since they're what I know, and maybe it's nice that Lambda can trigger based on objects appearing in S3.)
I'd be very happy to see this proceed with any choice of orchestration tool.
and maybe it's nice that Lambda can trigger based on objects appearing in S3.)
That does sound nice and like the right tool for the job. I have no experience with Lambda unfortuantely.
@mrocklin, I was able to access the cluster; thank you! I got a good bit of work done quickly using it. More details are below.
Before I started with the full, different-size file analysis, I wanted to log where we're starting to get a baseline and demonstrate just how useful your DataFrame optimizations were.
filename | raw_file_size | default_pandas_file_size | optimized_pandas_file_size |
---|---|---|---|
fhvhv_tripdata_2019-02.parquet | 489.29 MiB | 10.85 GiB | 2.13 GiB |
fhvhv_tripdata_2019-03.parquet | 582.56 MiB | 12.88 GiB | 2.53 GiB |
fhvhv_tripdata_2019-04.parquet | 533.73 MiB | 11.82 GiB | 2.31 GiB |
fhvhv_tripdata_2019-05.parquet | 544.38 MiB | 12.29 GiB | 2.37 GiB |
fhvhv_tripdata_2019-06.parquet | 511.35 MiB | 11.97 GiB | 2.22 GiB |
fhvhv_tripdata_2019-07.parquet | 492.10 MiB | 11.27 GiB | 2.15 GiB |
fhvhv_tripdata_2019-08.parquet | 478.21 MiB | 11.17 GiB | 2.13 GiB |
fhvhv_tripdata_2019-09.parquet | 492.04 MiB | 11.15 GiB | 2.12 GiB |
fhvhv_tripdata_2019-10.parquet | 523.92 MiB | 11.74 GiB | 2.23 GiB |
fhvhv_tripdata_2019-11.parquet | 535.02 MiB | 12.02 GiB | 2.29 GiB |
fhvhv_tripdata_2019-12.parquet | 549.94 MiB | 12.35 GiB | 2.35 GiB |
fhvhv_tripdata_2020-01.parquet | 506.66 MiB | 11.41 GiB | 2.17 GiB |
fhvhv_tripdata_2020-02.parquet | 532.98 MiB | 12.06 GiB | 2.30 GiB |
fhvhv_tripdata_2020-03.parquet | 330.43 MiB | 7.44 GiB | 1.42 GiB |
fhvhv_tripdata_2020-04.parquet | 109.52 MiB | 2.46 GiB | 467.12 MiB |
fhvhv_tripdata_2020-05.parquet | 153.11 MiB | 3.38 GiB | 658.75 MiB |
fhvhv_tripdata_2020-06.parquet | 188.80 MiB | 4.18 GiB | 815.54 MiB |
fhvhv_tripdata_2020-07.parquet | 248.20 MiB | 5.53 GiB | 1.05 GiB |
fhvhv_tripdata_2020-08.parquet | 276.99 MiB | 6.16 GiB | 1.17 GiB |
fhvhv_tripdata_2020-09.parquet | 300.11 MiB | 6.73 GiB | 1.28 GiB |
fhvhv_tripdata_2020-10.parquet | 329.19 MiB | 7.57 GiB | 1.40 GiB |
fhvhv_tripdata_2020-11.parquet | 287.87 MiB | 6.44 GiB | 1.23 GiB |
fhvhv_tripdata_2020-12.parquet | 289.22 MiB | 6.46 GiB | 1.23 GiB |
fhvhv_tripdata_2021-01.parquet | 294.61 MiB | 6.62 GiB | 1.26 GiB |
fhvhv_tripdata_2021-02.parquet | 288.61 MiB | 6.44 GiB | 1.23 GiB |
fhvhv_tripdata_2021-03.parquet | 351.31 MiB | 7.90 GiB | 1.50 GiB |
fhvhv_tripdata_2021-04.parquet | 351.35 MiB | 7.84 GiB | 1.49 GiB |
fhvhv_tripdata_2021-05.parquet | 369.31 MiB | 8.18 GiB | 1.56 GiB |
fhvhv_tripdata_2021-06.parquet | 375.86 MiB | 8.31 GiB | 1.58 GiB |
fhvhv_tripdata_2021-07.parquet | 377.62 MiB | 8.34 GiB | 1.59 GiB |
fhvhv_tripdata_2021-08.parquet | 364.62 MiB | 8.04 GiB | 1.53 GiB |
fhvhv_tripdata_2021-09.parquet | 375.45 MiB | 8.26 GiB | 1.57 GiB |
fhvhv_tripdata_2021-10.parquet | 410.19 MiB | 9.19 GiB | 1.75 GiB |
fhvhv_tripdata_2021-11.parquet | 392.02 MiB | 8.92 GiB | 1.70 GiB |
fhvhv_tripdata_2021-12.parquet | 391.90 MiB | 8.92 GiB | 1.70 GiB |
fhvhv_tripdata_2022-01.parquet | 357.26 MiB | 8.20 GiB | 1.56 GiB |
fhvhv_tripdata_2022-02.parquet | 388.29 MiB | 8.89 GiB | 1.69 GiB |
fhvhv_tripdata_2022-03.parquet | 449.40 MiB | 10.24 GiB | 1.95 GiB |
fhvhv_tripdata_2022-04.parquet | 434.45 MiB | 9.86 GiB | 1.88 GiB |
fhvhv_tripdata_2022-05.parquet | 446.86 MiB | 10.09 GiB | 1.92 GiB |
fhvhv_tripdata_2022-06.parquet | 436.97 MiB | 9.88 GiB | 1.88 GiB |
fhvhv_tripdata_2022-07.parquet | 423.17 MiB | 9.70 GiB | 1.85 GiB |
fhvhv_tripdata_2022-08.parquet | 416.31 MiB | 9.55 GiB | 1.82 GiB |
fhvhv_tripdata_2022-09.parquet | 436.79 MiB | 9.88 GiB | 1.88 GiB |
I started the Coiled cluster and read all of the "s3://nyc-tlc/trip data/fhvhvtripdata*.parquet" files into a Dask DataFrame and performed your data type optimization, and persisted the DataFrame. I wanted to test a few things, repartitioning the data into various sizes. I chose to do the following by writing snappy parquet files and pumping them to an S3 bucket in the same region:
I have yet to test performance from reading these files from Snowflake over each "folder" of right-sized files, but I will look to try this soon.
I would have to answer your question regarding typical query performance comparing Coiled and Snowflake with the classic: "it depends." I would imagine in most scenarios that if data is persisted to a Snowflake table, we would likely see better performance due to the nature of Snowflake's Micro-partitions & Data Clustering. The performance is likely to be similar if both platforms were to read files directly from external (in Snowflake's case) cloud storage. I agree that it would be useful to test and explore different operations and compare this.
Regarding an orchestration solution, I think we can try an example with AWS Lambda using Event Notifications, as @dchudz mentioned about notifications with the objects landing in S3. Would it be possible to do a small test with Lambda to see if a Coiled cluster could be spun up and do the file splitting/writing? This might mean the appropriate Python libraries must be deployed for the runtime dependency. I think this would be feasible as the Lambda itself would be using limited compute; this would mostly occur in the Coiled cluster.
I hope to have some of those benchmarks in Snowflake soon regarding the different-size file analysis.
Would it be possible to do a small test with Lambda to see if a Coiled cluster could be spun up and do the file splitting/writing? This might mean the appropriate Python libraries must be deployed for the runtime dependency. I think this would be feasible as the Lambda itself would be using limited compute; this would mostly occur in the Coiled cluster.
Sounds possible!
There are some subtleties if you want the cluster to keep working after the lambda times out (15 minutes).
I have a draft blog post on that I never finished (oops) that I can send you. So that's one reason lambda isn't ideal. It's definitely feasible for the cluster to finish its work after the lambda exits, but maybe annoying. Happy to help though, or we can go another route.
Here's the draft post just in case you do end up with a cluster that you want to keep working when its client lambda goes away:
There are some subtleties if you want the cluster to keep working after the lambda times out (15 minutes).
I imagine that we'll likely spend less than a minute after the cluster comes up. We can always do fire-and-forget though if we want to be safe (and let the lambda release quickly).
@dchudz one can always submit a general function to run on the cluster as well:
from dask.distributed import get_client, fire_and_forget
def f():
client = get_client()
df = dd.read_parquet(...)
df.to_parquet(...)
fire_and_forget(client.submit(f))
This is maybe a little easier than being careful with compute calls.
I want to look to complete the orchestration pipeline, these do seem like feasible solutions, and I'll look to see if I can have another colleague help me set this up from the Lambda side. We'll probably want to start with fewer workers and smaller files for testing, but the fire_and_forget
reference is helpful here. For inspiration, I want to watch this video.
I've put together a repo demonstrating the value of right-sizing files before storage persistence; I would love your feedback. Now that the data is persisted, we can do some performance comparisons if you like when retrieving data. I put a few examples at the bottom of the notebook.
https://github.com/IndexSeek/ingestion-snowflake-coiled-benchmarking/blob/main/notebook.ipynb
@IndexSeek -- I've been working on creating an orchestration pipeline. I spent some time with Lambda but ended up switching to Prefect for orchestration. It's still a work in progress, but wanted to share a link to the repo in case you were interested. I'll be refining this over the next day or so to publish.
@hayesgb This is great stuff; this is a similar road I started to go down, but it differs as I was looking to trigger the Prefect Flow with an S3 Event Notification and Lambda using the Prefect REST API, but admittedly, your solution has less configuration involved with the daily flow execution checking for files.
If we can, I would still like to demonstrate the "end-to-end" solution with Snowflake; this would occur downstream via AWS/Snowflake with an event notification configured on the bucket to which the load_and_clean_data
task is writing to. This complete solution might demonstrate the following:
@dchudz and I were chatting with @IndexSeek who was curious about extending this example to show how to pre-process data for efficient loading into Snowflake.
@IndexSeek there is a notebook in this repository with the example that you saw in this video . Are you able to run it? (also, if running this on your personal account becomes onerous let me know and I'll add your Coiled account to one of ours).
If so, I'll suggest some next steps:
Thoughts?