Closed gnzjgo closed 1 year ago
To address the SK change I had to follow the Iterating Data Sources guide Scenario 3 (though with versions instead of different names).
In summary:
tb push datasources/ping_response.datasource
tb push pipes/tb_materialize_until_change_ingest.pipe
# after the given ts, it is time to run the backfill populate
tb push pipes/tb_backfill_populate.pipe --populate --wait
# after populate ends, it is time to remove the pipe
tb pipe rm tb_backfill_populate --yes
To do:
#ping_response.datasource
VERSION 3
SCHEMA >
`id` String `json:$.id`,
`latency` Int16 `json:$.latency`,
`monitorId` String `json:$.monitorId`,
`pageId` String `json:$.pageId`,
`region` LowCardinality(String) `json:$.region`,
`statusCode` Int16 `json:$.statusCode`,
`timestamp` Int64 `json:$.timestamp`,
`url` String `json:$.url`,
`workspaceId` String `json:$.workspaceId`,
`cronTimestamp` Int64 `json:$.cronTimestamp`,
`metadata` String `json:$.metadata`
ENGINE "MergeTree"
ENGINE_SORTING_KEY "monitorId, cronTimestamp"
ENGINE_PARTITION_KEY "toYYYYMM(fromUnixTimestamp64Milli(cronTimestamp))"
Here the notable changes are SK and using LowCardinality(String) for the region, plus avoiding nullable fields. Also added a Partition Key that I think will make sense.
#tb_materialize_until_change_ingest.pipe
NODE mat_node
SQL >
SELECT
id,
latency,
monitorId,
pageId,
toLowCardinality(region) region,
statusCode,
timestamp,
url,
workspaceId,
coalesce(cronTimestamp, 0) cronTimestamp,
coalesce(metadata, '') metadata
FROM ping_response__v2
WHERE fromUnixTimestamp64Milli(cronTimestamp) > '2023-09-05 22:16:00.000'
TYPE materialized
DATASOURCE ping_response__v3
NODE mat_node
SQL >
SELECT
id,
latency,
monitorId,
pageId,
toLowCardinality(region) region,
statusCode,
timestamp,
url,
workspaceId,
coalesce(cronTimestamp, 0) cronTimestamp,
coalesce(metadata, '') metadata
FROM ping_response__v2
WHERE fromUnixTimestamp64Milli(cronTimestamp) <= '2023-09-05 22:16:00.000'
TYPE materialized
DATASOURCE ping_response__v3
Note in these 2 we have to explicitly determine version cause we're moving data from v2 to v3 and, if we left it as ping_response, it would have taken the latest version and tried a circular MV from ping_response__v3 to ping_response__v3.
Note we're sure we have the same data in both DS
And performance changes, from 60MB to 3MB in monitor_list
__ FYI @thibaultleouay, @mxkaske I tried to push a new branch with the changes in ping_response and couldn't:
git push --set-upstream origin 278-tinybird-performance-improvements
ERROR: Permission to openstatusHQ/openstatus.git denied to gnzjgo.
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
No issue, the only change we want to keep in the repo is the new version of ping_response and the code is in the comment. Temporary pipes were just for the migration.
Hey @gnzjgo! Thanks again for your support. We have successfully migrated to ping_response__v3
.
issue to address some performance improvements in the tinybird resources, starting by the Data Sources Sorting Keys, and exploring pipe SQL syntax and Materialized Views.