Open astrojuanlu opened 1 week ago
@astrojuanlu, could you please clarify the following:
clickhouse_connect
?If this is correct, could you share the full version of the code you used to create the current table in Snowflake? It looks like part of it was missing (I noticed some sections replaced with "...").
In the first step, should we focus on simply regularising the current process of loading data from clickhouse_connect?
The current process was me uploading the Parquet file to Snowflake using the UI 😄
So, the goal is to have that automated in Python as a Kedro project. It can be done using Snowpark, Polars, Ibis, raw SQL, and possibly other ways.
In the second step, after further discussions, should we aim to make the process more granular by incorporating data from the raw PyPI dataset?
That second step should be explored separately, only if our reporting needs demand it.
could you share the full version of the code you used to create the current table in Snowflake? It looks like part of it was missing (I noticed some sections replaced with "...").
In case it helps, this is the generated SQL code:
CREATE TABLE "KEDRO_BI_DB"."PYPI"."TEST_ALL_KEDRO_AGGREGATES2" ( project VARCHAR , version VARCHAR , date DATE , installer VARCHAR , type VARCHAR , count NUMBER(38, 0) );
CREATE TEMP FILE FORMAT "KEDRO_BI_DB"."PYPI"."temp_file_format_infer_2024-09-24T14:05:18.630Z"
TYPE=PARQUET
REPLACE_INVALID_CHARACTERS=TRUE
BINARY_AS_TEXT=FALSE;
COPY INTO "KEDRO_BI_DB"."PYPI"."TEST_ALL_KEDRO_AGGREGATES2"
FROM (SELECT $1:project::VARCHAR, $1:version::VARCHAR, $1:date::DATE, $1:installer::VARCHAR, $1:type::VARCHAR, $1:count::NUMBER(38, 0)
FROM '@"KEDRO_BI_DB"."PYPI"."__snowflake_temp_import_files__"')
FILES = ('2024-09-24T14:05:05.076Z/all_kedro_aggregates.pq')
FILE_FORMAT = '"KEDRO_BI_DB"."PYPI"."temp_file_format_infer_2024-09-24T14:05:18.630Z"'
ON_ERROR=ABORT_STATEMENT
-- For more details, see: https://docs.snowflake.com/en/sql-reference/sql/copy-into-table
(This issue tracks an internal workstream of the Kedro team at QB)
@lid-rs is working on some dashboards that help us analise our metrics and report on their progress. Now there's a sample of PyPI downloads in our internal database that was ingested with a one-off process:
But naturally, this data was a snapshot that will get out of date over time.
Goal
Periodically ingest PyPI downloads data of Kedro and related packages that serves our downloads monitoring.
Notes