I used a manual binary search looking for the rows that generate the error Cannot query rows larger than 100MB limit. I did not find any single row but any smaller range than the window on the microseconds below does not produce the error. Any greater does. There are ~14k rows in this range but none are greater than 100MB when exported.
This query failed on 2023-05-02:
DELETE
FROM `mlab-staging.tmp_ndt.scamper1` AS target
WHERE date = "2023-04-30" AND EXTRACT(HOUR from parser.Time) = 10 AND EXTRACT(MINUTE from parser.Time) = 50 AND EXTRACT(SECOND from parser.Time) = 53 AND EXTRACT(MILLISECOND from parser.Time) BETWEEN 7 AND 654
# This identifies all rows that don't match rows to preserve.
AND NOT EXISTS (
# This creates list of rows to preserve, based on key and priority.
WITH keep AS (
SELECT * EXCEPT(row_number) FROM (
SELECT
date, id, parser.Time,
ROW_NUMBER() OVER (
PARTITION BY id, date
ORDER BY parser.Time DESC
) row_number
FROM (
SELECT * FROM `mlab-staging.tmp_ndt.scamper1`
WHERE date = "2023-04-30" AND EXTRACT(HOUR from parser.Time) = 10 AND EXTRACT(MINUTE from parser.Time) = 50 AND EXTRACT(SECOND from parser.Time) = 53 AND EXTRACT(MILLISECOND from parser.Time) BETWEEN 7 AND 654
)
)
WHERE row_number = 1
)
SELECT * FROM keep
# This matches against the keep table based on keys. Sufficient select keys must be
# used to distinguish the preferred row from the others.
WHERE
target.id = keep.id AND
target.parser.Time = keep.Time
)
As of 2023-05-30, mlab-staging.tmp_ndt no longer includes the tmp tables for 04-30. However, the mlab-oti.tmp_ndt dataset still includes the 04-30 table data. The tmp dates can be found with a query like:
SELECT
CAST(partition_id AS DATE FORMAT "YYYYMMDD") AS date,
total_rows AS total
FROM `mlab-oti.tmp_ndt.INFORMATION_SCHEMA.PARTITIONS`
WHERE table_name = 'scamper1' AND partition_id != "__NULL__" AND total_rows > 0
ORDER BY date
Root cause is still unknown.
So, I'm wondering a few things:
can we have a simpler dedup query than this one?
is the reported error a misnomer, implying some other issue or corrupted underlying data? (e.g. would reprocessing resolve it or recreate the same conditions?)
what else could be special about the content of the measurements in the range 2023-04-30 10:50:53 between 7 and 654 ms?
I used a manual binary search looking for the rows that generate the error
Cannot query rows larger than 100MB limit.
I did not find any single row but any smaller range than the window on the microseconds below does not produce the error. Any greater does. There are ~14k rows in this range but none are greater than 100MB when exported.This query failed on 2023-05-02:
As of 2023-05-30,
mlab-staging.tmp_ndt
no longer includes the tmp tables for 04-30. However, themlab-oti.tmp_ndt
dataset still includes the 04-30 table data. The tmp dates can be found with a query like:Root cause is still unknown.
So, I'm wondering a few things: