m-lab / etl-gardener

Gardener provides services for maintaining and reprocessing mlab data.
Apache License 2.0
13 stars 5 forks source link

Failure to dedup: "Cannot query rows larger than 100MB limit." #427

Open stephen-soltesz opened 1 year ago

stephen-soltesz commented 1 year ago

I used a manual binary search looking for the rows that generate the error Cannot query rows larger than 100MB limit. I did not find any single row but any smaller range than the window on the microseconds below does not produce the error. Any greater does. There are ~14k rows in this range but none are greater than 100MB when exported.

This query failed on 2023-05-02:

DELETE
FROM `mlab-staging.tmp_ndt.scamper1` AS target
WHERE date = "2023-04-30" AND EXTRACT(HOUR from parser.Time) = 10 AND EXTRACT(MINUTE from parser.Time) = 50 AND EXTRACT(SECOND from parser.Time) = 53 AND EXTRACT(MILLISECOND from parser.Time) BETWEEN 7 AND 654
# This identifies all rows that don't match rows to preserve.
AND NOT EXISTS (
  # This creates list of rows to preserve, based on key and priority.
  WITH keep AS (
  SELECT * EXCEPT(row_number) FROM (
    SELECT
      date, id, parser.Time,
      ROW_NUMBER() OVER (
        PARTITION BY id, date
        ORDER BY parser.Time DESC
      ) row_number
      FROM (
        SELECT * FROM `mlab-staging.tmp_ndt.scamper1`
        WHERE date = "2023-04-30" AND EXTRACT(HOUR from parser.Time) = 10 AND EXTRACT(MINUTE from parser.Time) = 50 AND EXTRACT(SECOND from parser.Time) = 53 AND EXTRACT(MILLISECOND from parser.Time) BETWEEN 7 AND 654
      )
    )
    WHERE row_number = 1
  )
  SELECT * FROM keep
  # This matches against the keep table based on keys.  Sufficient select keys must be
  # used to distinguish the preferred row from the others.
  WHERE
    target.id = keep.id AND
    target.parser.Time = keep.Time 
)

As of 2023-05-30, mlab-staging.tmp_ndt no longer includes the tmp tables for 04-30. However, the mlab-oti.tmp_ndt dataset still includes the 04-30 table data. The tmp dates can be found with a query like:

SELECT
  CAST(partition_id AS DATE FORMAT "YYYYMMDD") AS date,
  total_rows AS total
FROM `mlab-oti.tmp_ndt.INFORMATION_SCHEMA.PARTITIONS`
WHERE table_name = 'scamper1' AND partition_id != "__NULL__" AND total_rows > 0
ORDER BY date

Root cause is still unknown.

So, I'm wondering a few things: