kedro-org / kedro-devrel

Kedro developer relations team use this for content creation ideation and execution
Apache License 2.0
0 stars 3 forks source link

Ingest PyPI data for internal reporting #145

Open astrojuanlu opened 1 week ago

astrojuanlu commented 1 week ago

(This issue tracks an internal workstream of the Kedro team at QB)

@lid-rs is working on some dashboards that help us analise our metrics and report on their progress. Now there's a sample of PyPI downloads in our internal database that was ingested with a one-off process:

import clickhouse_connect

client = clickhouse_connect.get_client(host="clickpy-clickhouse.clickhouse.com", username="play", database="pypi", secure=True)

kedro_all_total_downloads = pl.from_arrow(client.query_arrow("SELECT project, count FROM pypi_downloads WHERE project LIKE 'kedro%'"))

...
df.write_parquet("all_kedro_aggregates.pq")

But naturally, this data was a snapshot that will get out of date over time.

Goal

Periodically ingest PyPI downloads data of Kedro and related packages that serves our downloads monitoring.

Notes

DimedS commented 8 hours ago

@astrojuanlu, could you please clarify the following:

  1. In the first step, should we focus on simply regularising the current process of loading data from clickhouse_connect?
  2. In the second step, after further discussions, should we aim to make the process more granular by incorporating data from the raw PyPI dataset?

If this is correct, could you share the full version of the code you used to create the current table in Snowflake? It looks like part of it was missing (I noticed some sections replaced with "...").

astrojuanlu commented 4 hours ago

In the first step, should we focus on simply regularising the current process of loading data from clickhouse_connect?

The current process was me uploading the Parquet file to Snowflake using the UI 😄

image

So, the goal is to have that automated in Python as a Kedro project. It can be done using Snowpark, Polars, Ibis, raw SQL, and possibly other ways.

In the second step, after further discussions, should we aim to make the process more granular by incorporating data from the raw PyPI dataset?

That second step should be explored separately, only if our reporting needs demand it.

could you share the full version of the code you used to create the current table in Snowflake? It looks like part of it was missing (I noticed some sections replaced with "...").

In case it helps, this is the generated SQL code:

CREATE TABLE "KEDRO_BI_DB"."PYPI"."TEST_ALL_KEDRO_AGGREGATES2" ( project VARCHAR , version VARCHAR , date DATE , installer VARCHAR , type VARCHAR , count NUMBER(38, 0) ); 

CREATE TEMP FILE FORMAT "KEDRO_BI_DB"."PYPI"."temp_file_format_infer_2024-09-24T14:05:18.630Z"
    TYPE=PARQUET
    REPLACE_INVALID_CHARACTERS=TRUE
    BINARY_AS_TEXT=FALSE; 

COPY INTO "KEDRO_BI_DB"."PYPI"."TEST_ALL_KEDRO_AGGREGATES2" 
FROM (SELECT $1:project::VARCHAR, $1:version::VARCHAR, $1:date::DATE, $1:installer::VARCHAR, $1:type::VARCHAR, $1:count::NUMBER(38, 0)
    FROM '@"KEDRO_BI_DB"."PYPI"."__snowflake_temp_import_files__"') 
FILES = ('2024-09-24T14:05:05.076Z/all_kedro_aggregates.pq') 
FILE_FORMAT = '"KEDRO_BI_DB"."PYPI"."temp_file_format_infer_2024-09-24T14:05:18.630Z"' 
ON_ERROR=ABORT_STATEMENT 
-- For more details, see: https://docs.snowflake.com/en/sql-reference/sql/copy-into-table