klahnakoski / mo-sql-parsing

Let's make a SQL parser so we can provide a familiar interface to non-sql datastores!
Mozilla Public License 2.0
261 stars 58 forks source link

Certain queries cause infinite loop and out of memory issues #242

Closed adesso-alex closed 4 months ago

adesso-alex commented 4 months ago

Used version: 10.646.24152 with Python 3.12

Unfortunately, certain more complicated queries freeze the parsing process and consume an immense amount of RAM (30-50GB) until going out of memory. The queries are written in the BigQuery SQL dialect.

Example code:

from mo_sql_parsing import parse

query = """query here"""
parse(query)

Example query 1: SELECT COUNT(DISTINCT CASE WHEN ((orders.entitlement = 'A' AND orders.perm = 'A' AND orders.site = 'X' OR orders.entitlement = 'A' AND orders.perm = 'A, B' AND orders.site = 'X' OR orders.product_id LIKE 'P%' AND orders.site = 'X' OR orders.entitlement IS NULL AND orders.perm = 'A' AND orders.site = 'X' OR orders.site = 'Y' AND (orders.perm = 'Y_A, Y_B' OR orders.perm = 'A') AND orders.product_id <> '12345' OR orders.site = 'Z' AND (orders.perm = 'Z_A, Z_B' OR orders.perm = 'A') AND orders.product_id <> '67890' OR orders.perm LIKE '%A%' AND orders.site = 'W')) AND ((orders.article_url LIKE '%abc%' OR orders.article_url LIKE '%def%' OR orders.article_url LIKE '%ghi%')) THEN orders.key ELSE NULL END) AS orders_conversion, COUNT(DISTINCT CASE WHEN (engagements.engagement_segment = 'engaged') THEN engagements.user_id ELSE NULL END) AS engagements_user, COUNT(DISTINCT CASE WHEN (analytics.event_name LIKE 'page_view' OR analytics.event_name LIKE 'screen_view') THEN analytics.unique_key ELSE NULL END) AS article_views FROM "database"."schema"."analytics" AS analytics LEFT JOIN "database"."schema"."orders" AS orders ON analytics.content_id = orders.content_id AND analytics.site = orders.site AND (DATE(analytics.event_timestamp, 'Region/TimeZone')) = (DATE(orders.event_time, 'Region/TimeZone')) AND ((DATE(orders.event_time, 'Region/TimeZone')) BETWEEN (DATE(COALESCE(TIMESTAMP(DATETIME_ADD(DATETIME(TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY, 'Region/TimeZone'), 'Region/TimeZone'), -28, DAY), 'Region/TimeZone'), CURRENT_TIMESTAMP()), 'Region/TimeZone')) AND (DATE(COALESCE(TIMESTAMP(DATETIME_ADD(DATETIME(TIMESTAMP(DATETIME_ADD(DATETIME(TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY, 'Region/TimeZone'), 'Region/TimeZone'), -28, DAY), 'Region/TimeZone'), 'Region/TimeZone'), 28, DAY), 'Region/TimeZone'), CURRENT_TIMESTAMP()), 'Region/TimeZone'))) LEFT JOIN "database"."schema"."engagements" AS engagements ON (DATE(analytics.event_timestamp, 'Region/TimeZone')) = engagements.event_date AND analytics.user_id = engagements.user_id AND (engagements.event_date BETWEEN (DATE(COALESCE(TIMESTAMP(DATETIME_ADD(DATETIME(TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY, 'Region/TimeZone'), 'Region/TimeZone'), -28, DAY), 'Region/TimeZone'), CURRENT_TIMESTAMP()), 'Region/TimeZone')) AND (DATE(COALESCE(TIMESTAMP(DATETIME_ADD(DATETIME(TIMESTAMP(DATETIME_ADD(DATETIME(TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY, 'Region/TimeZone'), 'Region/TimeZone'), -28, DAY), 'Region/TimeZone'), 'Region/TimeZone'), 28, DAY), 'Region/TimeZone'), CURRENT_TIMESTAMP()), 'Region/TimeZone'))) WHERE (LENGTH(analytics.device.web_info.hostname) = 0 OR analytics.device.web_info.hostname IS NULL OR analytics.device.web_info.hostname = 'example.com') AND ((analytics.event_timestamp >= (TIMESTAMP(DATETIME_ADD(DATETIME(TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY, 'Region/TimeZone'), 'Region/TimeZone'), -28, DAY), 'Region/TimeZone'))) AND (analytics.event_timestamp < (TIMESTAMP(DATETIME_ADD(DATETIME(TIMESTAMP(DATETIME_ADD(DATETIME(TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY, 'Region/TimeZone'), 'Region/TimeZone'), -28, DAY), 'Region/TimeZone'), 'Region/TimeZone'), 28, DAY), 'Region/TimeZone'))))) AND (analytics.site = 'X') AND (analytics.site IN ('X')) LIMIT 1;

Example query 2: SELECT (FORMAT_TIMESTAMP('%Y-%m', analytics.event_timestamp, 'Region/TimeZone')) AS event_timestamp_month, COUNT(DISTINCT CASE WHEN (engagements.engagement_segment = 'engaged') THEN engagements.user_id ELSE NULL END) AS engagements_engaged_user FROM "database"."schema"."analytics" AS analytics LEFT JOIN "database"."schema"."content" AS content ON analytics.content_id = content.id AND analytics.site = content.site LEFT JOIN UNNEST(content.matching_teams) AS _q_0(content_matching_teams) LEFT JOIN "database"."schema"."engagements" AS engagements ON (DATE(analytics.event_timestamp, 'Region/TimeZone')) = engagements.event_date AND analytics.user_id = engagements.user_id AND (engagements.event_date BETWEEN (DATE(COALESCE(TIMESTAMP(DATETIME_ADD(DATETIME(TIMESTAMP_TRUNC(TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY, 'Region/TimeZone'), MONTH, 'Region/TimeZone'), 'Region/TimeZone'), -6, MONTH), 'Region/TimeZone'), CURRENT_TIMESTAMP()), 'Region/TimeZone')) AND (DATE(COALESCE(TIMESTAMP(DATETIME_ADD(DATETIME(TIMESTAMP(DATETIME_ADD(DATETIME(TIMESTAMP_TRUNC(TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY, 'Region/TimeZone'), MONTH, 'Region/TimeZone'), 'Region/TimeZone'), -6, MONTH), 'Region/TimeZone'), 'Region/TimeZone'), 6, MONTH), 'Region/TimeZone'), CURRENT_TIMESTAMP()), 'Region/TimeZone'))) WHERE ((REGEXP_EXTRACT(analytics.event_params__page_location, '^https?:\\/\\/(?:www\\.)?([^\\/]+)', 1)) IS NULL OR (REGEXP_EXTRACT(analytics.event_params__page_location, '^https?:\\/\\/(?:www\\.)?([^\\/]+)', 1)) = 'example1.com' OR (REGEXP_EXTRACT(analytics.event_params__page_location, '^https?:\\/\\/(?:www\\.)?([^\\/]+)', 1)) = 'example2.com' OR (REGEXP_EXTRACT(analytics.event_params__page_location, '^https?:\\/\\/(?:www\\.)?([^\\/]+)', 1)) = 'example3.com' OR (REGEXP_EXTRACT(analytics.event_params__page_location, '^https?:\\/\\/(?:www\\.)?([^\\/]+)', 1)) = 'example4.com') AND (_q_0.content_matching_teams) = 'Team/Team (Gesamt)' AND (((analytics.event_timestamp) >= (TIMESTAMP(DATETIME_ADD(DATETIME(TIMESTAMP_TRUNC(TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY, 'Region/TimeZone'), MONTH, 'Region/TimeZone'), 'Region/TimeZone'), -6, MONTH), 'Region/TimeZone'))) AND (analytics.event_timestamp) < (TIMESTAMP(DATETIME_ADD(DATETIME(TIMESTAMP(DATETIME_ADD(DATETIME(TIMESTAMP_TRUNC(TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY, 'Region/TimeZone'), MONTH, 'Region/TimeZone'), 'Region/TimeZone'), -6, MONTH), 'Region/TimeZone'), 'Region/TimeZone'), 6, MONTH), 'Region/TimeZone')))) AND (analytics.site = 'X') AND (analytics.site IN ('X')) AND (1 = 1 AND _q_0.content_matching_teams LIKE 'Team/Team') GROUP BY (FORMAT_TIMESTAMP('%Y-%m', analytics.event_timestamp, 'Region/TimeZone')) ORDER BY event_timestamp_month DESC LIMIT 500;

I would really appreciate help or a fix so that such queries can be parsed as well.

klahnakoski commented 4 months ago

looking...

klahnakoski commented 4 months ago

...definitely a problem. I thought it might go too deep into recursive functions, but that's not happening. It seems to consume large amounts of memory as it finalizes each token. I am looking into this more.

Thank you for this submission

klahnakoski commented 4 months ago

@adesso-alex Thank you very much for pointing out this problem. It has been around for a while.

https://pypi.org/project/mo-sql-parsing/10.651.24172/

klahnakoski commented 4 months ago

problem was accumulation of way-too-many parsing exceptions. Now we keep only the best:

https://github.com/klahnakoski/mo-parsing/compare/8.645.24152...8.650.24172#diff-d43c47c706d2fab7eccae59f837e06c477f9ce84f8e3f48ac345f79b412ab4eaL41

not perfect, but that can be fixed later.

adesso-alex commented 4 months ago

@klahnakoski Thank you very much for the very quick fix!

I will possibly open another ticket on actual parsing errors.