m-lab / etl-gardener

Gardener provides services for maintaining and reprocessing mlab data.
Apache License 2.0
13 stars 5 forks source link

Gardener fails to make progress #123

Open stephen-soltesz opened 5 years ago

stephen-soltesz commented 5 years ago

Also see #124 In addition to https://github.com/m-lab/etl-gardener/issues/121 -- Gardener may fail to make progress due to some task_filenames using old source buckets e.g. archive-mlab-staging scraper-mlab-staging etc.

The new gardener source only uses archive-measurement-lab. The old source rows should be deleted. e.g.

Find bad rows:

SELECT
    "ndt" AS dataset,
    COUNT(*) as count,
    COUNTIF(REGEXP_CONTAINS(task_filename, r'gs://(archive|scraper)-mlab-staging')) as delete_count
FROM   `mlab-staging.base_tables.ndt`   

Delete those rows.

DELETE FROM `mlab-staging.base_tables.ndt`
WHERE REGEXP_CONTAINS(task_filename, r'gs://(archive|scraper)-mlab-staging')
stephen-soltesz commented 5 years ago

Attempting to delete from the table generates an error:

UPDATE or DELETE statement over table mlab-staging.base_tables.ndt would affect rows in the streaming buffer, which is not supported

It may be necessary to suspend gardener, drain the streaming buffer and then run the delete operation?

stephen-soltesz commented 5 years ago

Attempting to exclude the unpartitioned data using hints from https://cloud.google.com/bigquery/docs/querying-partitioned-tables#ingestion-time_partitioned_tables_unpartitioned_partition i.e. (WHERE _PARTITIONTIME IS NOT NULL) does not seem to work either.

stephen-soltesz commented 5 years ago
  1. The original report is partly correct, and partly mistaken. Partly mistaken because the daily pipelines still add archive-<project> task filenames every day. Partly correct because none of those archives should be earlier than, say, 2017 when we turned on the external pipeline.

    Recent attempts to delete some rows in staging and oti tables may have inadvertently deleted some archive-<project> records. Eventually these will be overwritten by gardener.

  2. mlab-staging will add task-filenames that are not present in archive-measurement-lab due to scraper collection from canary machines. These feed the staging pipeline and are distinct archives from those collected by mlab-oti (different timestamps) and possibly different contents.

    In practice this likely means that staging will have more unique task_filenames in base_tables (includes canary archive) than the batch tables (includes only archive-measurement-lab).

stephen-soltesz commented 5 years ago

Looking at the task_filenames per day in BQ and GCS there appear to be a significant number of GCS files that are not being processed.

  SELECT REGEXP_EXTRACT(task_filename, r"/(201812[0-9]{2})T") as day, COUNT(distinct task_filename) as tasks
  FROM  `mlab-oti.batch.ndt`                                          
  WHERE _PARTITIONTIME >= TIMESTAMP("2018-12-01") AND _PARTITIONTIME <= TIMESTAMP("2018-12-31")
  GROUP BY day
  ORDER BY day

Note: the gsutil list of a month is quite slow (8-10min).

gsutil ls gs://archive-measurement-lab/ndt/2018/12/*/* | tr '/T' ' ' | awk '{print $7}' | sort | uniq -c

For example:

row Date BQ batch GCS Files
1 20181201 42042 42413
2 20181202 40355 40758
3 20181203 41274 41682
4 20181204 43073 43476
5 20181205 42954 43351
6 20181206 43286 43709
stephen-soltesz commented 5 years ago

If gardener+parsers are deterministic, then I would not expect the content of the batch tables to be out of sync between mlab-staging & mlab-oti -- both of which source only gs://archive-measurement-lab buckets. (except possibly days currently being processed).

However, we find ~83 rows that have different task_filename counts.

SELECT day, staging.tasks as staging_tasks, oti.tasks as oti_tasks
FROM (
  SELECT REGEXP_EXTRACT(task_filename, r"/(20[0-9]{6})T") as day, COUNT(distinct task_filename) as tasks
  FROM  `mlab-staging.batch.ndt`                                          
  GROUP BY day
) as staging FULL JOIN (
  SELECT REGEXP_EXTRACT(task_filename, r"/(20[0-9]{6})T") as day, COUNT(distinct task_filename) as tasks
  FROM  `mlab-oti.batch.ndt`                                          
  GROUP BY day
) as oti USING(day)
WHERE staging.tasks != oti.tasks
ORDER BY day
Row day tasks tasks2  
9 20120110 504 497  
10 20120113 539 546  
11 20120115 502 504  
12 20120122 697 685  
13 20120126 473 478  

... 49 | 20170601 | 10833 | 10746 |   50 | 20170602 | 11034 | 11075 |   51 | 20170603 | 10045 | 9977 |   52 | 20170604 | 10014 | 9980 |   53 | 20170605 | 10584 | 10532 |   ...

gfr10598 commented 5 years ago

Possible causes to investigate:

  1. task queue is using memory quota and silently dropping tasks
  2. tasks are expiring (exacerbated by empty queue bug)
  3. empty queue bug - is it really fixed now?
  4. parser encountering errors, but reporting success to the task queue.
    • We know there are ephemeral GCS errors. Are they handled well enough?
  5. Are bigquery queries reliable? (we are fairly confident about these, but maybe we missed something)

Other TODO:

  1. try disabling task expiration and see if that helps.
  2. ensure that the datastore status is maintained for failed copies.
  3. re-enable the template table deletion??
gfr10598 commented 5 years ago

PR #128 preserves the datastore tasks that end in error.

gfr10598 commented 5 years ago

128 deployed to staging last night. Also deployed to sandbox, which shows that ALL partitions are failing to update, because of previous change that only submits 1/5th of files for parsing.

Staging now shows 3 failed sanity checks.

gfr10598 commented 5 years ago

ACTUALLY - looks like this is garbage. Looking at the actual queue entries it looks like the tasks are all recent. Summary page now shows only a few sidestream queues with old entries, and drilling down shows the summaries are wrong.

https://pantheon.corp.google.com/cloudtasks?project=mlab-staging&organizationId=433637338589&tab=PUSH Looks like many staging task queues are simply stalled with very old tasks. ALL the sidestream queues have 31 day old tasks. 3 of the 4 disco also have 31 day old tasks ndt queue 10 is 33 days ndt queues 15 and 4 are 31 days old

gfr10598 commented 5 years ago

FYI, I'm wiping the sandbox partitions, since they will never match after file skipping. for m in seq -w 12; do for d in seq -w 31; do bq --project=mlab-sandbox rm -tf mlab-sandbox:base-tables.ndt\$2009$m$d; done; done &

stephen-soltesz commented 5 years ago

Canary machines and mlab4s generate archives that end up in the staging base_tables dataset, but will never appear in the gs://archive-measurement-lab bucket. When comparing staging batch to base_tables data, these machines should be treated skeptically if the task_filenames are not found in the public archive -- https://github.com/m-lab/operator/blob/master/plsync/canary_machines.txt

gfr10598 commented 5 years ago

PROGRESS:

select base.task_filename, base.test_count as base_count, batch.test_count as batch_count, base.max_pt from (select base, batch from
(select "base" as tbl, task_filename, max(parse_time) as max_pt, count(DISTINCT test_id) as test_count
from `mlab-staging.base_tables.ndt`
where _partitiontime = timestamp("2018-01-28")
group by task_filename
) as base
full join
(select "batch" as tbl, task_filename, max(parse_time) as max_pt, count(DISTINCT test_id) as test_count
from `mlab-staging.batch.ndt`
where _partitiontime = timestamp("2018-01-28")
group by task_filename
) as batch
on base.task_filename = batch.task_filename)
where batch.task_filename is null or base.task_filename is null
or batch.test_count != base.test_count

This shows that the task_filenames that are MISSING from batch are the old tasks from archive-mlab-staging, before we switched over to reprocessing from mlab-oti. I've saved the output in mlab-staging:batch.gardener_mismatches.

Of roughly 20K tasks, there are perhaps a hundred or so that have different numbers of test_ids. It seems that the recent run of the batch parser ALWAYS produced more rows than the previous run in October.

It seems strange that the base_tables.ndt partition has a mix of rows from two different parse times. Stumped about why that would be. Gardener copies (and replaces) the whole partition.

gfr10598 commented 5 years ago

Note: on mac-os, the bq rm command seems to consume a lot of cpu. Much easier: delete mlab-sandbox.base_tables.ndt where _PARTITIONTIME > "2009-01-01"

gfr10598 commented 5 years ago

Confirmed that staging 2017-12-14 follows same pattern as 2018-01-28. Approximately 1800 (of 22k) task files have more tests than the corresponding task in base. 268 task files in base are from archive-mlab-staging, and have no corresponding tasks in batch.

Planning to delete these two partitions, so that they can be overwritten the next time through. Or perhaps better to copy the corresponding partition from batch?

gfr10598 commented 5 years ago

This is very fast: (note the $ are escaped) bq --project=mlab-staging cp batch.ndt\$20180128 base_tables.ndt\$20180128 bq --project=mlab-staging cp batch.ndt\$20171214 base_tables.ndt\$20171214

gfr10598 commented 5 years ago

SELECT * FROM (SELECT _PARTITIONTIME as pt, COUNT(DISTINCT task_filename) as tasks, MAX(parse_time) as max_pt, COUNT(DISTINCT test_id) as test_count FROM mlab-staging.base_tables.ndt GROUP BY pt ORDER BY pt) WHERE max_pt < "2019-01-01"

After the previous table copies, there are now 20 partitions that are older than Jan 1.

pt tasks max_pt test_count 2009-09-01 0:00:00 56 2018-12-06 18:00:17 21201 2012-01-26 0:00:00 490 2018-11-02 15:08:01 234369 2012-01-30 0:00:00 576 2018-11-02 15:12:57 265926 2012-02-03 0:00:00 579 2018-11-02 15:17:16 284447 2012-02-08 0:00:00 669 2018-11-02 15:23:28 273683 2012-02-09 0:00:00 1016 2018-11-02 15:26:30 302805 2012-02-10 0:00:00 976 2018-11-02 15:26:38 333726 2012-02-11 0:00:00 976 2018-11-02 15:29:08 361018 2012-02-12 0:00:00 957 2018-11-02 15:31:23 344205 2012-02-13 0:00:00 755 2018-11-02 15:30:42 292035 2012-02-14 0:00:00 822 2018-11-02 15:33:37 269299 2012-02-15 0:00:00 895 2018-11-02 15:35:49 269324 2012-02-16 0:00:00 996 2018-11-02 15:36:54 292708 2012-02-17 0:00:00 909 2018-11-02 15:38:28 298351 2012-02-18 0:00:00 1025 2018-11-02 15:41:36 355710 2012-02-19 0:00:00 910 2018-11-02 15:41:42 333828 2012-02-20 0:00:00 940 2018-11-02 15:43:38 341403 2012-02-21 0:00:00 982 2018-11-02 15:47:24 309417 2012-02-22 0:00:00 1223 2018-11-02 15:49:44 320443 2012-02-23 0:00:00 1050 2018-11-02 15:53:39 300352

gfr10598 commented 5 years ago

2009-09-01
ndt/2009/09/01/20090901T000000Z-mlab3-nuq01-ndt-0002.tgz 358 353 ndt/2009/09/01/20090901T000000Z-mlab3-nuq01-ndt-0003.tgz 110 null ndt/2009/09/01/20090901T000000Z-mlab3-nuq01-ndt-0001.tgz 100 215

gfr10598 commented 5 years ago

Last time mlab3-nuq01-ndt-0003 was processed (in staging) was Feb 4 by "aef-etl--batch--parser-20190111t043232-ggbx"

2019-02-04 11:07:46.000 GMT 2019/02/04 11:07:46 task.go:138: Processed 1287 files, 0 nil data, 211 rows committed, 0 failed, from gs://archive-measurement-lab/ndt/2009/09/01/20090901T000000Z-mlab3-nuq01-ndt-0003.tgz into ndt_20090901

Is it possible that the dedup is eliminating some rows? Are we including the task_filename in the dedup query? Well, no we don't include the task file in the dedup, so if a file appears in two task files, it should be dedupped. But, mlab3-nuq01-ndt-0003 is mostly unique files that do not appear in 0000, 0001, 0002. The last file in 0002 also appears in 0003, and should be dedupped.

https://pantheon.corp.google.com/logs/viewer?interval=JUMP_TO_TIME&project=mlab-staging&organizationId=433637338589&minLogLevel=0&expandAll=false&timestamp=2019-02-04T11:00:13.000000000Z&customFacets=&limitCustomFacetWidth=true&scrollTimestamp=2019-02-04T11:07:46.000000000Z&resource=gae_app%2Fmodule_id%2Fetl-batch-parser&logName=projects%2Fmlab-staging%2Flogs%2Fappengine.googleapis.com%252Fstdout&logName=projects%2Fmlab-staging%2Flogs%2Fappengine.googleapis.com%252Fstderr&logName=projects%2Fmlab-staging%2Flogs%2Fappengine.googleapis.com%252Fnginx.request&filters=text:20090901T000000Z-mlab3-nuq01-ndt-000&dateRangeStart=2019-02-04T10:00:13.000Z&dateRangeEnd=2019-02-04T12:00:13.000Z

gfr10598 commented 5 years ago

Other interesting notes about ndt/2009/09/01/20090901T000000Z-mlab3-nuq01-ndt-*: The 0002 and 0003 files seem to contain two copies of most of the snaplog files. Ah - that is why one file is seen in both 0002 and 0003 - one copy in each file. 0000 and 0001 contain only a single copy of each file. Argh.

Looking at the number of unique files in each task file: 0.snaps, 708 708 50603 1.snaps, 568 568 40754 2.snaps, 358 358 25649 3.snaps, 211 211 15176 However, in the base_tables, we only see a sensible number of rows in 0002. ndt/2009/09/01/20090901T000000Z-mlab3-nuq01-ndt-0001.tgz 100 215 ndt/2009/09/01/20090901T000000Z-mlab3-nuq01-ndt-0002.tgz 358 353 ndt/2009/09/01/20090901T000000Z-mlab3-nuq01-ndt-0003.tgz 110 null

The recent Feb 4 batch logs show a better story: Processed 2250 files, 0 nil data, 703 rows committed, 0 failed, from gs://archive-measurement-lab/ndt/2009/09/01/20090901T000000Z-mlab3-nuq01-ndt-0000.tgz into ndt_20090901 Processed 1790 files, 0 nil data, 568 rows committed, 0 failed, from gs://archive-measurement-lab/ndt/2009/09/01/20090901T000000Z-mlab3-nuq01-ndt-0001.tgz into ndt_20090901 Processed 2062 files, 0 nil data, 358 rows committed, 0 failed, from gs://archive-measurement-lab/ndt/2009/09/01/20090901T000000Z-mlab3-nuq01-ndt-0002.tgz into ndt_20090901 Processed 1287 files, 0 nil data, 211 rows committed, 0 failed, from gs://archive-measurement-lab/ndt/2009/09/01/20090901T000000Z-mlab3-nuq01-ndt-0003.tgz into ndt_20090901

So we seem to be losing a lot of rows between the etl batch output, and the deduplication. Could be related either to a bug in the dedup, or running the dedup prematurely, before the streaming buffer has been committed?

gfr10598 commented 5 years ago

Clear evidence that the deduplication step is dropping rows that should not be dropped. Could this be due to the streaming buffer not being empty?

SELECT dedup.test_id AS dedup, dedup.parse_time AS dpt, template.test_id AS template, template.parse_time AS tpt FROM ( SELECT dedup, template FROM ( SELECT "dedup" AS tbl, task_filename, test_id, parse_time FROM mlab-sandbox.batch.ndt WHERE _partitiontime = TIMESTAMP("2010-08-24") AND task_filename="gs://archive-measurement-lab/ndt/2010/08/24/20100824T000000Z-mlab2-ams02-ndt-0007.tgz" ) AS dedup FULL JOIN ( SELECT "template" AS tbl, task_filename, test_id, parse_time FROM mlab-sandbox.batch.ndt_20100824 WHERE task_filename="gs://archive-measurement-lab/ndt/2010/08/24/20100824T000000Z-mlab2-ams02-ndt-0007.tgz" ) AS template ON dedup.test_id = template.test_id ) ORDER BY template

gfr10598 commented 5 years ago

Verified that the dedup is apparently running before the streaming buffer is empty. There seems to be a bug in the meta-data. Added a workaround to recheck repeatedly for 10 minutes.

Copies the mlab-sandbox.batch.ndt to mlab-sandbox.base_tables.ndt to give it a more or less clean start.

gfr10598 commented 5 years ago

Actually, copying from batch isn't what we want. There are partitions left over from the bug that restarted gardener on the wrong date when DateSkip > 0. That bug was just fixed today, so we need to purge all the old rows, from prior to about 20:00 UTC today.

So: delete from mlab-sandbox.base_tables.ndt where parse_time < "2019-02-12 20:00:00"

Need to for all tables in both batch and base_tables.

gfr10598 commented 5 years ago

An extreme case of the StreamingBuffer bug. Basically, the buffer state flaps twice within 30 seconds.

2019/02/13 15:00:04 dedup.go:50: Wait for table ready mlab-sandbox:batch.ndt_20100204 2019/02/13 15:00:26 dedup.go:82: Streaming buffer was empty for 5.05115389s but now it is not! mlab-2019/02/13 15:00:32 dedup.go:82: Streaming buffer was empty for 11.086396524s but now it is not! mlab-sandbox:batch.ndt_20100204

gfr10598 commented 5 years ago

Well, the extended StreamingBuffer check has not fully resolved the problem. A recent failure in staging is ndt 2017-06-02. It has thousands of taskfiles that exist in base_tables, but are missing or have fewer tests in batch (after dedup). However, in the templates table, there are NO missing or short task files, so this is almost certainly a premature dedup.

GOOD news is that this seems to be the only recent failure in staging. The previous sanity-check failure was 24 hours earlier. But in prod there have been quite a few: {gs://archive-measurement-lab/ndt/2009/09/01/: Finishing, Thu16:04:49.6, Q:, ... {gs://archive-measurement-lab/ndt/2009/09/02/: Finishing, Thu15:57:22.6, Q:, {gs://archive-measurement-lab/ndt/2009/09/03/: Finishing, Thu15:57:06.5, Q:, {gs://archive-measurement-lab/ndt/2009/09/04/: Finishing, Thu16:16:39.2, Q:, {gs://archive-measurement-lab/ndt/2009/09/05/: Finishing, Thu15:55:09.3, Q:, {gs://archive-measurement-lab/ndt/2009/09/07/: Finishing, Thu16:12:37.6, Q:, {gs://archive-measurement-lab/ndt/2009/09/08/: Finishing, Thu16:00:39.6, Q:, {gs://archive-measurement-lab/ndt/2012/02/22/: Finishing, Thu21:52:04.9, Q:, {gs://archive-measurement-lab/ndt/2012/03/02/: Finishing, Thu22:06:46.0, Q:, {gs://archive-measurement-lab/ndt/2012/03/03/: Finishing, Thu22:06:08.6, Q:,

The sanity check failed at 18:21 UTC (today Thursday). Searching in the ETL logs for one of the missing task files (gs://archive-measurement-lab/ndt/2017/06/02/20170602T000000Z-mlab1-akl01-ndt-0008.tgz) show that file finished processing at 15:49 with 426 rows committed. Processing appears to have completed (based on gardener log) at 16:50. So the stabilizing time frame is reasonable - about 90 minutes - but obviously there were rows missing when the deduplication occurred.

The median stabilizing time around 1800 UTC was about 1.5 hours, and the 1% settling time was only 1 hour. So the stabilizing time for this particular example was about close to the median.

There were relatively few detected streaming buffer anomalies (for NDT) around this time - just 6 between midnight and 1800 UTC.

gfr10598 commented 5 years ago

Checking the streaming buffer anomalies - the longest observed in the past 24 hours is still less than 5 minutes. But the 10 minute recheck certainly seems to be inadequate.

Going to try cranking that up to 60 minutes, and see what happens.

gfr10598 commented 5 years ago

Looks there were no additional failures overnight in sandbox or staging.

gfr10598 commented 5 years ago

Status: Production sidestream is the furthest behind. On Mar 1, oldest parse was 31 days old, and there were 34 partitions older than Feb 12. On Mar 4, oldest parse is 33 days old, and there are 7 partitions older than Feb 12. and there are 385 that are more than 2 weeks old.

However, the oldest partitions are a single partition from 2009 (8/24), and a handful from 2016, 2017, and 2018. Gardener is currently processing sidestream for 2012, so it will be a few days before it gets to 2016, and more than a week before it gets through 2018 again.

The oldest partitions are: Partitiontime min parse_time
2009-08-24 00:00:00 UTC 2019-01-29 17:42:01.226909 UTC 2016-01-30 00:00:00 UTC 2019-02-04 10:58:58.914211 UTC 2016-05-24 00:00:00 UTC 2019-02-04 17:30:46.178752 UTC 2016-07-16 00:00:00 UTC 2019-02-04 20:08:56.944876 UTC 2017-03-18 00:00:00 UTC 2019-02-05 18:23:47.356516 UTC 2018-04-11 00:00:00 UTC 2019-02-09 02:33:01.483836 UTC 2018-04-12 00:00:00 UTC 2019-02-09 03:31:53.859690 UTC 2019-02-03 00:00:00 UTC 2019-02-13 17:28:38.700786 UTC 2010-01-22 00:00:00 UTC 2019-02-14 10:10:52.047556 UTC 2010-10-08 00:00:00 UTC 2019-02-15 02:27:44.422802 UTC 2012-10-26 00:00:00 UTC 2019-02-16 23:00:20.816268 UTC 2012-11-02 00:00:00 UTC 2019-02-16 23:27:16.925187 UTC 2012-11-03 00:00:00 UTC 2019-02-16 23:42:05.513576 UTC

gfr10598 commented 5 years ago

Another relevant detail: In most partitions with the oldest data, there are also rows with much more recent parse times. The partition with the oldest newest rows is 2016-01-30 00:00:00 UTC | 13 | 312 | 2019-02-04 (min) | 2019-02-04 (max). All other partitions with old rows, also have row processed roughly 2 weeks later. So apparently this is partly due to having old rows in the template table (because it wasn't deleted last iteration), and some old rows making it through the dedup for some reason.