Ingest PyPI data for internal reporting

astrojuanlu commented 1 week ago

(This issue tracks an internal workstream of the Kedro team at QB)

@lid-rs is working on some dashboards that help us analise our metrics and report on their progress. Now there's a sample of PyPI downloads in our internal database that was ingested with a one-off process:

import clickhouse_connect

client = clickhouse_connect.get_client(host="clickpy-clickhouse.clickhouse.com", username="play", database="pypi", secure=True)

kedro_all_total_downloads = pl.from_arrow(client.query_arrow("SELECT project, count FROM pypi_downloads WHERE project LIKE 'kedro%'"))

...
df.write_parquet("all_kedro_aggregates.pq")

But naturally, this data was a snapshot that will get out of date over time.

Goal

Periodically ingest PyPI downloads data of Kedro and related packages that serves our downloads monitoring.

Notes

ClickPy https://clickpy.clickhouse.com/ has some aggregates that, crucially, have zero cost. Are these aggregates granular enough for our reporting? @lid-rs will drive the requirements, and engineering can figure out whether ClickPy is good enough, whether we should ingest the raw data from Google BigQuery as explained in https://packaging.python.org/en/latest/guides/analyzing-pypi-package-downloads/, or some other solution.
What's the appropriate ingest frequency? Weekly, daily, monthly?
Ideally, this should be achieved through a Kedro pipeline - and, well, we should report any friction or issues we see along the way.
- This PoC can be used as starting point https://github.com/astrojuanlu/kedro-pypi-monitor but whoever implements this is free to do whatever.
Where should this code reside and be executed? I propose an internal repository triggered by GitHub Actions, but other solutions might exist.

DimedS commented 8 hours ago

@astrojuanlu, could you please clarify the following:

In the first step, should we focus on simply regularising the current process of loading data from clickhouse_connect?
In the second step, after further discussions, should we aim to make the process more granular by incorporating data from the raw PyPI dataset?

If this is correct, could you share the full version of the code you used to create the current table in Snowflake? It looks like part of it was missing (I noticed some sections replaced with "...").

astrojuanlu commented 4 hours ago

In the first step, should we focus on simply regularising the current process of loading data from clickhouse_connect?

The current process was me uploading the Parquet file to Snowflake using the UI 😄

So, the goal is to have that automated in Python as a Kedro project. It can be done using Snowpark, Polars, Ibis, raw SQL, and possibly other ways.

In the second step, after further discussions, should we aim to make the process more granular by incorporating data from the raw PyPI dataset?

That second step should be explored separately, only if our reporting needs demand it.

could you share the full version of the code you used to create the current table in Snowflake? It looks like part of it was missing (I noticed some sections replaced with "...").

In case it helps, this is the generated SQL code:

CREATE TABLE "KEDRO_BI_DB"."PYPI"."TEST_ALL_KEDRO_AGGREGATES2" ( project VARCHAR , version VARCHAR , date DATE , installer VARCHAR , type VARCHAR , count NUMBER(38, 0) ); 

CREATE TEMP FILE FORMAT "KEDRO_BI_DB"."PYPI"."temp_file_format_infer_2024-09-24T14:05:18.630Z"
    TYPE=PARQUET
    REPLACE_INVALID_CHARACTERS=TRUE
    BINARY_AS_TEXT=FALSE; 

COPY INTO "KEDRO_BI_DB"."PYPI"."TEST_ALL_KEDRO_AGGREGATES2" 
FROM (SELECT $1:project::VARCHAR, $1:version::VARCHAR, $1:date::DATE, $1:installer::VARCHAR, $1:type::VARCHAR, $1:count::NUMBER(38, 0)
    FROM '@"KEDRO_BI_DB"."PYPI"."__snowflake_temp_import_files__"') 
FILES = ('2024-09-24T14:05:05.076Z/all_kedro_aggregates.pq') 
FILE_FORMAT = '"KEDRO_BI_DB"."PYPI"."temp_file_format_infer_2024-09-24T14:05:18.630Z"' 
ON_ERROR=ABORT_STATEMENT 
-- For more details, see: https://docs.snowflake.com/en/sql-reference/sql/copy-into-table

kedro-org / kedro-devrel

Ingest PyPI data for internal reporting #145

Goal

Notes