openstatusHQ / openstatus

🏓 The open-source synthetic monitoring platform 🏓
https://openstatus.dev
GNU Affero General Public License v3.0
6.44k stars 405 forks source link

Improve tinybird data project #278

Closed gnzjgo closed 1 year ago

gnzjgo commented 1 year ago

issue to address some performance improvements in the tinybird resources, starting by the Data Sources Sorting Keys, and exploring pipe SQL syntax and Materialized Views.

gnzjgo commented 1 year ago

Migration from ping_response__v2 to ping_response__v3

Brief explanation of the process

To address the SK change I had to follow the Iterating Data Sources guide Scenario 3 (though with versions instead of different names).

In summary:

  1. Create the new ping_response DS v3
  2. Start Materializing data from a future timestamp
  3. After that given timestamp, start the backfill with a populate
  4. Once the populate ends, remove the backfill materializing pipe
  5. Push again the endpoints to make them read from latest ping_response version, ping_response__v3
  6. Change ingest from ping_response__v2 to ping_response__v3 (pending)
  7. Delete tb_materialize_until_change_ingest.pipe and ping_response__v2 (pending)
image

Details

tb push datasources/ping_response.datasource
tb push pipes/tb_materialize_until_change_ingest.pipe
# after the given ts, it is time to run the backfill populate
tb push pipes/tb_backfill_populate.pipe --populate --wait
# after populate ends, it is time to remove the pipe
tb pipe rm tb_backfill_populate  --yes

To do:

New DS

#ping_response.datasource
VERSION 3

SCHEMA >
    `id` String `json:$.id`,
    `latency` Int16 `json:$.latency`,
    `monitorId` String `json:$.monitorId`,
    `pageId` String `json:$.pageId`,
    `region` LowCardinality(String) `json:$.region`,
    `statusCode` Int16 `json:$.statusCode`,
    `timestamp` Int64 `json:$.timestamp`,
    `url` String `json:$.url`,
    `workspaceId` String `json:$.workspaceId`,
    `cronTimestamp` Int64 `json:$.cronTimestamp`,
    `metadata` String `json:$.metadata`

ENGINE "MergeTree"
ENGINE_SORTING_KEY "monitorId, cronTimestamp"
ENGINE_PARTITION_KEY "toYYYYMM(fromUnixTimestamp64Milli(cronTimestamp))"

Here the notable changes are SK and using LowCardinality(String) for the region, plus avoiding nullable fields. Also added a Partition Key that I think will make sense.

Temporary MV pipes:

#tb_materialize_until_change_ingest.pipe
NODE mat_node
SQL >

    SELECT
        id,
        latency,
        monitorId,
        pageId,
        toLowCardinality(region) region,
        statusCode,
        timestamp,
        url,
        workspaceId,
        coalesce(cronTimestamp, 0) cronTimestamp,
        coalesce(metadata, '') metadata
    FROM ping_response__v2
    WHERE fromUnixTimestamp64Milli(cronTimestamp) > '2023-09-05 22:16:00.000'

TYPE materialized
DATASOURCE ping_response__v3
NODE mat_node
SQL >

    SELECT
        id,
        latency,
        monitorId,
        pageId,
        toLowCardinality(region) region,
        statusCode,
        timestamp,
        url,
        workspaceId,
        coalesce(cronTimestamp, 0) cronTimestamp,
        coalesce(metadata, '') metadata
    FROM ping_response__v2
    WHERE fromUnixTimestamp64Milli(cronTimestamp) <= '2023-09-05 22:16:00.000'

TYPE materialized
DATASOURCE ping_response__v3

Note in these 2 we have to explicitly determine version cause we're moving data from v2 to v3 and, if we left it as ping_response, it would have taken the latest version and tried a circular MV from ping_response__v3 to ping_response__v3.

Note we're sure we have the same data in both DS

image

And performance changes, from 60MB to 3MB in monitor_list

image

__ FYI @thibaultleouay, @mxkaske I tried to push a new branch with the changes in ping_response and couldn't:

git push --set-upstream origin 278-tinybird-performance-improvements
ERROR: Permission to openstatusHQ/openstatus.git denied to gnzjgo.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

No issue, the only change we want to keep in the repo is the new version of ping_response and the code is in the comment. Temporary pipes were just for the migration.

mxkaske commented 1 year ago

Hey @gnzjgo! Thanks again for your support. We have successfully migrated to ping_response__v3.