Open stephen-soltesz opened 5 years ago
Attempting to delete from the table generates an error:
UPDATE or DELETE statement over table mlab-staging.base_tables.ndt would affect rows in the streaming buffer, which is not supported
It may be necessary to suspend gardener, drain the streaming buffer and then run the delete operation?
Attempting to exclude the unpartitioned data using hints from https://cloud.google.com/bigquery/docs/querying-partitioned-tables#ingestion-time_partitioned_tables_unpartitioned_partition i.e. (WHERE _PARTITIONTIME IS NOT NULL
) does not seem to work either.
The original report is partly correct, and partly mistaken. Partly mistaken because the daily pipelines still add archive-<project>
task filenames every day. Partly correct because none of those archives should be earlier than, say, 2017 when we turned on the external pipeline.
Recent attempts to delete some rows in staging and oti tables may have inadvertently deleted some archive-<project>
records. Eventually these will be overwritten by gardener.
mlab-staging will add task-filenames that are not present in archive-measurement-lab due to scraper collection from canary machines. These feed the staging pipeline and are distinct archives from those collected by mlab-oti (different timestamps) and possibly different contents.
In practice this likely means that staging will have more unique task_filenames in base_tables (includes canary archive) than the batch tables (includes only archive-measurement-lab).
Looking at the task_filenames
per day in BQ and GCS there appear to be a significant number of GCS files that are not being processed.
SELECT REGEXP_EXTRACT(task_filename, r"/(201812[0-9]{2})T") as day, COUNT(distinct task_filename) as tasks
FROM `mlab-oti.batch.ndt`
WHERE _PARTITIONTIME >= TIMESTAMP("2018-12-01") AND _PARTITIONTIME <= TIMESTAMP("2018-12-31")
GROUP BY day
ORDER BY day
Note: the gsutil list of a month is quite slow (8-10min).
gsutil ls gs://archive-measurement-lab/ndt/2018/12/*/* | tr '/T' ' ' | awk '{print $7}' | sort | uniq -c
For example:
row | Date | BQ batch | GCS Files |
---|---|---|---|
1 | 20181201 | 42042 | 42413 |
2 | 20181202 | 40355 | 40758 |
3 | 20181203 | 41274 | 41682 |
4 | 20181204 | 43073 | 43476 |
5 | 20181205 | 42954 | 43351 |
6 | 20181206 | 43286 | 43709 |
If gardener+parsers are deterministic, then I would not expect the content of the batch tables to be out of sync between mlab-staging & mlab-oti -- both of which source only gs://archive-measurement-lab buckets. (except possibly days currently being processed).
However, we find ~83 rows that have different task_filename counts.
SELECT day, staging.tasks as staging_tasks, oti.tasks as oti_tasks
FROM (
SELECT REGEXP_EXTRACT(task_filename, r"/(20[0-9]{6})T") as day, COUNT(distinct task_filename) as tasks
FROM `mlab-staging.batch.ndt`
GROUP BY day
) as staging FULL JOIN (
SELECT REGEXP_EXTRACT(task_filename, r"/(20[0-9]{6})T") as day, COUNT(distinct task_filename) as tasks
FROM `mlab-oti.batch.ndt`
GROUP BY day
) as oti USING(day)
WHERE staging.tasks != oti.tasks
ORDER BY day
Row | day | tasks | tasks2 | |
---|---|---|---|---|
9 | 20120110 | 504 | 497 | |
10 | 20120113 | 539 | 546 | |
11 | 20120115 | 502 | 504 | |
12 | 20120122 | 697 | 685 | |
13 | 20120126 | 473 | 478 |
... 49 | 20170601 | 10833 | 10746 | 50 | 20170602 | 11034 | 11075 | 51 | 20170603 | 10045 | 9977 | 52 | 20170604 | 10014 | 9980 | 53 | 20170605 | 10584 | 10532 | ...
Possible causes to investigate:
Other TODO:
PR #128 preserves the datastore tasks that end in error.
Staging now shows 3 failed sanity checks.
ACTUALLY - looks like this is garbage. Looking at the actual queue entries it looks like the tasks are all recent. Summary page now shows only a few sidestream queues with old entries, and drilling down shows the summaries are wrong.
https://pantheon.corp.google.com/cloudtasks?project=mlab-staging&organizationId=433637338589&tab=PUSH Looks like many staging task queues are simply stalled with very old tasks. ALL the sidestream queues have 31 day old tasks. 3 of the 4 disco also have 31 day old tasks ndt queue 10 is 33 days ndt queues 15 and 4 are 31 days old
FYI, I'm wiping the sandbox partitions, since they will never match after file skipping.
for m in seq -w 12
; do for d in seq -w 31
; do bq --project=mlab-sandbox rm -tf mlab-sandbox:base-tables.ndt\$2009$m$d; done; done &
Canary machines and mlab4s generate archives that end up in the staging base_tables dataset, but will never appear in the gs://archive-measurement-lab bucket. When comparing staging batch to base_tables data, these machines should be treated skeptically if the task_filenames are not found in the public archive -- https://github.com/m-lab/operator/blob/master/plsync/canary_machines.txt
PROGRESS:
select base.task_filename, base.test_count as base_count, batch.test_count as batch_count, base.max_pt from (select base, batch from
(select "base" as tbl, task_filename, max(parse_time) as max_pt, count(DISTINCT test_id) as test_count
from `mlab-staging.base_tables.ndt`
where _partitiontime = timestamp("2018-01-28")
group by task_filename
) as base
full join
(select "batch" as tbl, task_filename, max(parse_time) as max_pt, count(DISTINCT test_id) as test_count
from `mlab-staging.batch.ndt`
where _partitiontime = timestamp("2018-01-28")
group by task_filename
) as batch
on base.task_filename = batch.task_filename)
where batch.task_filename is null or base.task_filename is null
or batch.test_count != base.test_count
This shows that the task_filenames that are MISSING from batch are the old tasks from archive-mlab-staging, before we switched over to reprocessing from mlab-oti. I've saved the output in mlab-staging:batch.gardener_mismatches.
Of roughly 20K tasks, there are perhaps a hundred or so that have different numbers of test_ids. It seems that the recent run of the batch parser ALWAYS produced more rows than the previous run in October.
It seems strange that the base_tables.ndt partition has a mix of rows from two different parse times. Stumped about why that would be. Gardener copies (and replaces) the whole partition.
Note: on mac-os, the bq rm command seems to consume a lot of cpu.
Much easier:
delete mlab-sandbox.base_tables.ndt
where _PARTITIONTIME > "2009-01-01"
Confirmed that staging 2017-12-14 follows same pattern as 2018-01-28. Approximately 1800 (of 22k) task files have more tests than the corresponding task in base. 268 task files in base are from archive-mlab-staging, and have no corresponding tasks in batch.
Planning to delete these two partitions, so that they can be overwritten the next time through. Or perhaps better to copy the corresponding partition from batch?
This is very fast: (note the $ are escaped) bq --project=mlab-staging cp batch.ndt\$20180128 base_tables.ndt\$20180128 bq --project=mlab-staging cp batch.ndt\$20171214 base_tables.ndt\$20171214
SELECT * FROM
(SELECT _PARTITIONTIME as pt, COUNT(DISTINCT task_filename) as tasks, MAX(parse_time) as max_pt, COUNT(DISTINCT test_id) as test_count
FROM mlab-staging.base_tables.ndt
GROUP BY pt ORDER BY pt)
WHERE max_pt < "2019-01-01"
After the previous table copies, there are now 20 partitions that are older than Jan 1.
pt tasks max_pt test_count 2009-09-01 0:00:00 56 2018-12-06 18:00:17 21201 2012-01-26 0:00:00 490 2018-11-02 15:08:01 234369 2012-01-30 0:00:00 576 2018-11-02 15:12:57 265926 2012-02-03 0:00:00 579 2018-11-02 15:17:16 284447 2012-02-08 0:00:00 669 2018-11-02 15:23:28 273683 2012-02-09 0:00:00 1016 2018-11-02 15:26:30 302805 2012-02-10 0:00:00 976 2018-11-02 15:26:38 333726 2012-02-11 0:00:00 976 2018-11-02 15:29:08 361018 2012-02-12 0:00:00 957 2018-11-02 15:31:23 344205 2012-02-13 0:00:00 755 2018-11-02 15:30:42 292035 2012-02-14 0:00:00 822 2018-11-02 15:33:37 269299 2012-02-15 0:00:00 895 2018-11-02 15:35:49 269324 2012-02-16 0:00:00 996 2018-11-02 15:36:54 292708 2012-02-17 0:00:00 909 2018-11-02 15:38:28 298351 2012-02-18 0:00:00 1025 2018-11-02 15:41:36 355710 2012-02-19 0:00:00 910 2018-11-02 15:41:42 333828 2012-02-20 0:00:00 940 2018-11-02 15:43:38 341403 2012-02-21 0:00:00 982 2018-11-02 15:47:24 309417 2012-02-22 0:00:00 1223 2018-11-02 15:49:44 320443 2012-02-23 0:00:00 1050 2018-11-02 15:53:39 300352
2009-09-01
ndt/2009/09/01/20090901T000000Z-mlab3-nuq01-ndt-0002.tgz 358 353
ndt/2009/09/01/20090901T000000Z-mlab3-nuq01-ndt-0003.tgz 110 null
ndt/2009/09/01/20090901T000000Z-mlab3-nuq01-ndt-0001.tgz 100 215
Last time mlab3-nuq01-ndt-0003 was processed (in staging) was Feb 4 by "aef-etl--batch--parser-20190111t043232-ggbx"
2019-02-04 11:07:46.000 GMT 2019/02/04 11:07:46 task.go:138: Processed 1287 files, 0 nil data, 211 rows committed, 0 failed, from gs://archive-measurement-lab/ndt/2009/09/01/20090901T000000Z-mlab3-nuq01-ndt-0003.tgz into ndt_20090901
Is it possible that the dedup is eliminating some rows? Are we including the task_filename in the dedup query? Well, no we don't include the task file in the dedup, so if a file appears in two task files, it should be dedupped. But, mlab3-nuq01-ndt-0003 is mostly unique files that do not appear in 0000, 0001, 0002. The last file in 0002 also appears in 0003, and should be dedupped.
Other interesting notes about ndt/2009/09/01/20090901T000000Z-mlab3-nuq01-ndt-*: The 0002 and 0003 files seem to contain two copies of most of the snaplog files. Ah - that is why one file is seen in both 0002 and 0003 - one copy in each file. 0000 and 0001 contain only a single copy of each file. Argh.
Looking at the number of unique files in each task file: 0.snaps, 708 708 50603 1.snaps, 568 568 40754 2.snaps, 358 358 25649 3.snaps, 211 211 15176 However, in the base_tables, we only see a sensible number of rows in 0002. ndt/2009/09/01/20090901T000000Z-mlab3-nuq01-ndt-0001.tgz 100 215 ndt/2009/09/01/20090901T000000Z-mlab3-nuq01-ndt-0002.tgz 358 353 ndt/2009/09/01/20090901T000000Z-mlab3-nuq01-ndt-0003.tgz 110 null
The recent Feb 4 batch logs show a better story: Processed 2250 files, 0 nil data, 703 rows committed, 0 failed, from gs://archive-measurement-lab/ndt/2009/09/01/20090901T000000Z-mlab3-nuq01-ndt-0000.tgz into ndt_20090901 Processed 1790 files, 0 nil data, 568 rows committed, 0 failed, from gs://archive-measurement-lab/ndt/2009/09/01/20090901T000000Z-mlab3-nuq01-ndt-0001.tgz into ndt_20090901 Processed 2062 files, 0 nil data, 358 rows committed, 0 failed, from gs://archive-measurement-lab/ndt/2009/09/01/20090901T000000Z-mlab3-nuq01-ndt-0002.tgz into ndt_20090901 Processed 1287 files, 0 nil data, 211 rows committed, 0 failed, from gs://archive-measurement-lab/ndt/2009/09/01/20090901T000000Z-mlab3-nuq01-ndt-0003.tgz into ndt_20090901
So we seem to be losing a lot of rows between the etl batch output, and the deduplication. Could be related either to a bug in the dedup, or running the dedup prematurely, before the streaming buffer has been committed?
Clear evidence that the deduplication step is dropping rows that should not be dropped. Could this be due to the streaming buffer not being empty?
SELECT dedup.test_id AS dedup, dedup.parse_time AS dpt, template.test_id AS template, template.parse_time AS tpt
FROM (
SELECT dedup, template
FROM (
SELECT "dedup" AS tbl, task_filename, test_id, parse_time
FROM mlab-sandbox.batch.ndt
WHERE _partitiontime = TIMESTAMP("2010-08-24")
AND task_filename="gs://archive-measurement-lab/ndt/2010/08/24/20100824T000000Z-mlab2-ams02-ndt-0007.tgz" ) AS dedup
FULL JOIN (
SELECT "template" AS tbl, task_filename, test_id, parse_time
FROM mlab-sandbox.batch.ndt_20100824
WHERE task_filename="gs://archive-measurement-lab/ndt/2010/08/24/20100824T000000Z-mlab2-ams02-ndt-0007.tgz" ) AS template
ON
dedup.test_id = template.test_id )
ORDER BY
template
Verified that the dedup is apparently running before the streaming buffer is empty. There seems to be a bug in the meta-data. Added a workaround to recheck repeatedly for 10 minutes.
Copies the mlab-sandbox.batch.ndt to mlab-sandbox.base_tables.ndt to give it a more or less clean start.
Actually, copying from batch isn't what we want. There are partitions left over from the bug that restarted gardener on the wrong date when DateSkip > 0. That bug was just fixed today, so we need to purge all the old rows, from prior to about 20:00 UTC today.
So:
delete from mlab-sandbox.base_tables.ndt
where parse_time < "2019-02-12 20:00:00"
Need to for all tables in both batch and base_tables.
An extreme case of the StreamingBuffer bug. Basically, the buffer state flaps twice within 30 seconds.
2019/02/13 15:00:04 dedup.go:50: Wait for table ready mlab-sandbox:batch.ndt_20100204 2019/02/13 15:00:26 dedup.go:82: Streaming buffer was empty for 5.05115389s but now it is not! mlab-2019/02/13 15:00:32 dedup.go:82: Streaming buffer was empty for 11.086396524s but now it is not! mlab-sandbox:batch.ndt_20100204
Well, the extended StreamingBuffer check has not fully resolved the problem. A recent failure in staging is ndt 2017-06-02. It has thousands of taskfiles that exist in base_tables, but are missing or have fewer tests in batch (after dedup). However, in the templates table, there are NO missing or short task files, so this is almost certainly a premature dedup.
GOOD news is that this seems to be the only recent failure in staging. The previous sanity-check failure was 24 hours earlier. But in prod there have been quite a few: {gs://archive-measurement-lab/ndt/2009/09/01/: Finishing, Thu16:04:49.6, Q:, ... {gs://archive-measurement-lab/ndt/2009/09/02/: Finishing, Thu15:57:22.6, Q:, {gs://archive-measurement-lab/ndt/2009/09/03/: Finishing, Thu15:57:06.5, Q:, {gs://archive-measurement-lab/ndt/2009/09/04/: Finishing, Thu16:16:39.2, Q:, {gs://archive-measurement-lab/ndt/2009/09/05/: Finishing, Thu15:55:09.3, Q:, {gs://archive-measurement-lab/ndt/2009/09/07/: Finishing, Thu16:12:37.6, Q:, {gs://archive-measurement-lab/ndt/2009/09/08/: Finishing, Thu16:00:39.6, Q:, {gs://archive-measurement-lab/ndt/2012/02/22/: Finishing, Thu21:52:04.9, Q:, {gs://archive-measurement-lab/ndt/2012/03/02/: Finishing, Thu22:06:46.0, Q:, {gs://archive-measurement-lab/ndt/2012/03/03/: Finishing, Thu22:06:08.6, Q:,
The sanity check failed at 18:21 UTC (today Thursday). Searching in the ETL logs for one of the missing task files (gs://archive-measurement-lab/ndt/2017/06/02/20170602T000000Z-mlab1-akl01-ndt-0008.tgz) show that file finished processing at 15:49 with 426 rows committed. Processing appears to have completed (based on gardener log) at 16:50. So the stabilizing time frame is reasonable - about 90 minutes - but obviously there were rows missing when the deduplication occurred.
The median stabilizing time around 1800 UTC was about 1.5 hours, and the 1% settling time was only 1 hour. So the stabilizing time for this particular example was about close to the median.
There were relatively few detected streaming buffer anomalies (for NDT) around this time - just 6 between midnight and 1800 UTC.
Checking the streaming buffer anomalies - the longest observed in the past 24 hours is still less than 5 minutes. But the 10 minute recheck certainly seems to be inadequate.
Going to try cranking that up to 60 minutes, and see what happens.
Looks there were no additional failures overnight in sandbox or staging.
Status: Production sidestream is the furthest behind. On Mar 1, oldest parse was 31 days old, and there were 34 partitions older than Feb 12. On Mar 4, oldest parse is 33 days old, and there are 7 partitions older than Feb 12. and there are 385 that are more than 2 weeks old.
However, the oldest partitions are a single partition from 2009 (8/24), and a handful from 2016, 2017, and 2018. Gardener is currently processing sidestream for 2012, so it will be a few days before it gets to 2016, and more than a week before it gets through 2018 again.
The oldest partitions are:
Partitiontime min parse_time
2009-08-24 00:00:00 UTC 2019-01-29 17:42:01.226909 UTC
2016-01-30 00:00:00 UTC 2019-02-04 10:58:58.914211 UTC
2016-05-24 00:00:00 UTC 2019-02-04 17:30:46.178752 UTC
2016-07-16 00:00:00 UTC 2019-02-04 20:08:56.944876 UTC
2017-03-18 00:00:00 UTC 2019-02-05 18:23:47.356516 UTC
2018-04-11 00:00:00 UTC 2019-02-09 02:33:01.483836 UTC
2018-04-12 00:00:00 UTC 2019-02-09 03:31:53.859690 UTC
2019-02-03 00:00:00 UTC 2019-02-13 17:28:38.700786 UTC
2010-01-22 00:00:00 UTC 2019-02-14 10:10:52.047556 UTC
2010-10-08 00:00:00 UTC 2019-02-15 02:27:44.422802 UTC
2012-10-26 00:00:00 UTC 2019-02-16 23:00:20.816268 UTC
2012-11-02 00:00:00 UTC 2019-02-16 23:27:16.925187 UTC
2012-11-03 00:00:00 UTC 2019-02-16 23:42:05.513576 UTC
Another relevant detail: In most partitions with the oldest data, there are also rows with much more recent parse times. The partition with the oldest newest rows is 2016-01-30 00:00:00 UTC | 13 | 312 | 2019-02-04 (min) | 2019-02-04 (max). All other partitions with old rows, also have row processed roughly 2 weeks later. So apparently this is partly due to having old rows in the template table (because it wasn't deleted last iteration), and some old rows making it through the dedup for some reason.
Also see #124 In addition to https://github.com/m-lab/etl-gardener/issues/121 -- Gardener may fail to make progress due to some task_filenames using old source buckets e.g. archive-mlab-staging scraper-mlab-staging etc.
The new gardener source only uses archive-measurement-lab. The old source rows should be deleted. e.g.
Find bad rows:
Delete those rows.