Closed adesso-alex closed 4 months ago
looking...
...definitely a problem. I thought it might go too deep into recursive functions, but that's not happening. It seems to consume large amounts of memory as it finalizes each token. I am looking into this more.
Thank you for this submission
@adesso-alex Thank you very much for pointing out this problem. It has been around for a while.
problem was accumulation of way-too-many parsing exceptions. Now we keep only the best:
not perfect, but that can be fixed later.
@klahnakoski Thank you very much for the very quick fix!
I will possibly open another ticket on actual parsing errors.
Used version: 10.646.24152 with Python 3.12
Unfortunately, certain more complicated queries freeze the parsing process and consume an immense amount of RAM (30-50GB) until going out of memory. The queries are written in the BigQuery SQL dialect.
Example code:
Example query 1:
SELECT COUNT(DISTINCT CASE WHEN ((orders.entitlement = 'A' AND orders.perm = 'A' AND orders.site = 'X' OR orders.entitlement = 'A' AND orders.perm = 'A, B' AND orders.site = 'X' OR orders.product_id LIKE 'P%' AND orders.site = 'X' OR orders.entitlement IS NULL AND orders.perm = 'A' AND orders.site = 'X' OR orders.site = 'Y' AND (orders.perm = 'Y_A, Y_B' OR orders.perm = 'A') AND orders.product_id <> '12345' OR orders.site = 'Z' AND (orders.perm = 'Z_A, Z_B' OR orders.perm = 'A') AND orders.product_id <> '67890' OR orders.perm LIKE '%A%' AND orders.site = 'W')) AND ((orders.article_url LIKE '%abc%' OR orders.article_url LIKE '%def%' OR orders.article_url LIKE '%ghi%')) THEN orders.key ELSE NULL END) AS orders_conversion, COUNT(DISTINCT CASE WHEN (engagements.engagement_segment = 'engaged') THEN engagements.user_id ELSE NULL END) AS engagements_user, COUNT(DISTINCT CASE WHEN (analytics.event_name LIKE 'page_view' OR analytics.event_name LIKE 'screen_view') THEN analytics.unique_key ELSE NULL END) AS article_views FROM "database"."schema"."analytics" AS analytics LEFT JOIN "database"."schema"."orders" AS orders ON analytics.content_id = orders.content_id AND analytics.site = orders.site AND (DATE(analytics.event_timestamp, 'Region/TimeZone')) = (DATE(orders.event_time, 'Region/TimeZone')) AND ((DATE(orders.event_time, 'Region/TimeZone')) BETWEEN (DATE(COALESCE(TIMESTAMP(DATETIME_ADD(DATETIME(TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY, 'Region/TimeZone'), 'Region/TimeZone'), -28, DAY), 'Region/TimeZone'), CURRENT_TIMESTAMP()), 'Region/TimeZone')) AND (DATE(COALESCE(TIMESTAMP(DATETIME_ADD(DATETIME(TIMESTAMP(DATETIME_ADD(DATETIME(TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY, 'Region/TimeZone'), 'Region/TimeZone'), -28, DAY), 'Region/TimeZone'), 'Region/TimeZone'), 28, DAY), 'Region/TimeZone'), CURRENT_TIMESTAMP()), 'Region/TimeZone'))) LEFT JOIN "database"."schema"."engagements" AS engagements ON (DATE(analytics.event_timestamp, 'Region/TimeZone')) = engagements.event_date AND analytics.user_id = engagements.user_id AND (engagements.event_date BETWEEN (DATE(COALESCE(TIMESTAMP(DATETIME_ADD(DATETIME(TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY, 'Region/TimeZone'), 'Region/TimeZone'), -28, DAY), 'Region/TimeZone'), CURRENT_TIMESTAMP()), 'Region/TimeZone')) AND (DATE(COALESCE(TIMESTAMP(DATETIME_ADD(DATETIME(TIMESTAMP(DATETIME_ADD(DATETIME(TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY, 'Region/TimeZone'), 'Region/TimeZone'), -28, DAY), 'Region/TimeZone'), 'Region/TimeZone'), 28, DAY), 'Region/TimeZone'), CURRENT_TIMESTAMP()), 'Region/TimeZone'))) WHERE (LENGTH(analytics.device.web_info.hostname) = 0 OR analytics.device.web_info.hostname IS NULL OR analytics.device.web_info.hostname = 'example.com') AND ((analytics.event_timestamp >= (TIMESTAMP(DATETIME_ADD(DATETIME(TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY, 'Region/TimeZone'), 'Region/TimeZone'), -28, DAY), 'Region/TimeZone'))) AND (analytics.event_timestamp < (TIMESTAMP(DATETIME_ADD(DATETIME(TIMESTAMP(DATETIME_ADD(DATETIME(TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY, 'Region/TimeZone'), 'Region/TimeZone'), -28, DAY), 'Region/TimeZone'), 'Region/TimeZone'), 28, DAY), 'Region/TimeZone'))))) AND (analytics.site = 'X') AND (analytics.site IN ('X')) LIMIT 1;
Example query 2:
SELECT (FORMAT_TIMESTAMP('%Y-%m', analytics.event_timestamp, 'Region/TimeZone')) AS event_timestamp_month, COUNT(DISTINCT CASE WHEN (engagements.engagement_segment = 'engaged') THEN engagements.user_id ELSE NULL END) AS engagements_engaged_user FROM "database"."schema"."analytics" AS analytics LEFT JOIN "database"."schema"."content" AS content ON analytics.content_id = content.id AND analytics.site = content.site LEFT JOIN UNNEST(content.matching_teams) AS _q_0(content_matching_teams) LEFT JOIN "database"."schema"."engagements" AS engagements ON (DATE(analytics.event_timestamp, 'Region/TimeZone')) = engagements.event_date AND analytics.user_id = engagements.user_id AND (engagements.event_date BETWEEN (DATE(COALESCE(TIMESTAMP(DATETIME_ADD(DATETIME(TIMESTAMP_TRUNC(TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY, 'Region/TimeZone'), MONTH, 'Region/TimeZone'), 'Region/TimeZone'), -6, MONTH), 'Region/TimeZone'), CURRENT_TIMESTAMP()), 'Region/TimeZone')) AND (DATE(COALESCE(TIMESTAMP(DATETIME_ADD(DATETIME(TIMESTAMP(DATETIME_ADD(DATETIME(TIMESTAMP_TRUNC(TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY, 'Region/TimeZone'), MONTH, 'Region/TimeZone'), 'Region/TimeZone'), -6, MONTH), 'Region/TimeZone'), 'Region/TimeZone'), 6, MONTH), 'Region/TimeZone'), CURRENT_TIMESTAMP()), 'Region/TimeZone'))) WHERE ((REGEXP_EXTRACT(analytics.event_params__page_location, '^https?:\\/\\/(?:www\\.)?([^\\/]+)', 1)) IS NULL OR (REGEXP_EXTRACT(analytics.event_params__page_location, '^https?:\\/\\/(?:www\\.)?([^\\/]+)', 1)) = 'example1.com' OR (REGEXP_EXTRACT(analytics.event_params__page_location, '^https?:\\/\\/(?:www\\.)?([^\\/]+)', 1)) = 'example2.com' OR (REGEXP_EXTRACT(analytics.event_params__page_location, '^https?:\\/\\/(?:www\\.)?([^\\/]+)', 1)) = 'example3.com' OR (REGEXP_EXTRACT(analytics.event_params__page_location, '^https?:\\/\\/(?:www\\.)?([^\\/]+)', 1)) = 'example4.com') AND (_q_0.content_matching_teams) = 'Team/Team (Gesamt)' AND (((analytics.event_timestamp) >= (TIMESTAMP(DATETIME_ADD(DATETIME(TIMESTAMP_TRUNC(TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY, 'Region/TimeZone'), MONTH, 'Region/TimeZone'), 'Region/TimeZone'), -6, MONTH), 'Region/TimeZone'))) AND (analytics.event_timestamp) < (TIMESTAMP(DATETIME_ADD(DATETIME(TIMESTAMP(DATETIME_ADD(DATETIME(TIMESTAMP_TRUNC(TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY, 'Region/TimeZone'), MONTH, 'Region/TimeZone'), 'Region/TimeZone'), -6, MONTH), 'Region/TimeZone'), 'Region/TimeZone'), 6, MONTH), 'Region/TimeZone')))) AND (analytics.site = 'X') AND (analytics.site IN ('X')) AND (1 = 1 AND _q_0.content_matching_teams LIKE 'Team/Team') GROUP BY (FORMAT_TIMESTAMP('%Y-%m', analytics.event_timestamp, 'Region/TimeZone')) ORDER BY event_timestamp_month DESC LIMIT 500;
I would really appreciate help or a fix so that such queries can be parsed as well.