mozilla / ichnaea

Mozilla Ichnaea
http://location.services.mozilla.com
Apache License 2.0
570 stars 139 forks source link

run map tile generation in server environment #840

Closed willkg closed 3 years ago

willkg commented 5 years ago

Previously, map tile generation (ichnaea/scripts/datamap.py) would get run like this via cron on the scheduler node:

17 0 * * * location ICHNAEA_CFG=/etc/location.ini /data/ichnaea/bin/location_map --create --concurrency=<%= @processorcount.to_i %> --upload --output=<%= @tiles_directory %> > /var/log/location/cronjobs/map.log 2>&1

We don't currently have cron running in the docker container. We should probably move this to a celery task or some other scheduler.

hannosch commented 5 years ago

This ran on the celery scheduler, as that machine was otherwise underutilized. The datamaps stuff benefits greatly from running parts of it in parallel processes and probably runs a couple hours. It dumps a large result from MySQL locally on disk, post-processes it, creates half a million or so tiny png files and than uploads the new and changed files to an Amazon S3 bucket (not every file is uploaded again, as S3 charges a lot more per PUT request than per GET request to do the diff). That bucket is exposed via cloudfront as a website and gets used a tile layer in the Mapbox code.

The script takes quite long, needs the large SQL dump and those files all on the same disk and does its own multiprocessing. That made it a bad fit for the normal Celery or any other task queue.

willkg commented 4 years ago

This should get figured out, but we think it'll be fine if we push it off until after the new infrastructure is up and running.

willkg commented 4 years ago

@ckolos said he can run this trivially.

However, I changed how ichnaea works in the Docker image and it no longer gets installed as a Python library, thus the location_map script isn't "built".

We need to fix how that gets run in the ichnaea Docker image and then we can hand it off to @ckolos and he can operationalyze it.

jwhitlock commented 3 years ago

I've checked that the script runs in the dev environment from a Docker instance, from a copy of stage's datamap tables. It took 70-80 minutes to generate tiles, and triggered my machine to reboot at least once. I think production is at least 10x larger, but I doubt it will take 10x longer.

I also tried uploading to a development S3 bucket. It took 80 minutes to generate tiles, then failed due to lack of credentials. I'm adding some logic to check that the caller can write to the bucket before generating the tiles. The script is now running with upload (no concurrency), I'll update with how long it took.

Update: Upload complete in 295 minutes, about 5 hours

My next step are to add more logging, tune metrics, and document the data flow (issue #1404). Then, I'll run it manually in stage, from the admin or cron server. Once we have the process figured out, we can operationalize it to run periodically.

jwhitlock commented 3 years ago

I've merged changes and deployed to staging.

The staging box has 1 processor versus the 6 I'm using locally. I also need to set LOCAL_DEV_ENV=1 and LOGGING_LEVEL=DEBUG to see the progress messages during debugging. It may be useful to add a --verbose command line argument to set up verbose logging and make sure the LOCAL_DEV_ENV flag doesn't propagate where I don't want it.

The first time I ran it, it was silent until an out-of memory stack trace:

Exception in thread Thread-10:
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/threading.py", line 954, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.9/threading.py", line 892, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.9/multiprocessing/pool.py", line 513, in _handle_workers
    cls._maintain_pool(ctx, Process, processes, pool, inqueue,
  File "/usr/local/lib/python3.9/multiprocessing/pool.py", line 337, in _maintain_pool
    Pool._repopulate_pool_static(ctx, Process, processes, pool,
  File "/usr/local/lib/python3.9/multiprocessing/pool.py", line 326, in _repopulate_pool_static
    w.start()
  File "/usr/local/lib/python3.9/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/usr/local/lib/python3.9/multiprocessing/context.py", line 277, in _Popen
    return Popen(process_obj)
  File "/usr/local/lib/python3.9/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/local/lib/python3.9/multiprocessing/popen_fork.py", line 66, in _launch
    self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

Then it continued to be silent until the shell disconnected.

On the second run, I was able to get more information with verbose logging:

The KeyboardInterrupt from Ctrl-c was propagated as an exception.

There are warnings online about memory cleanup in threads. This seems specific to boto3, I'll see if there is something explicit we should be doing.

jwhitlock commented 3 years ago

The low-level client also has increased memory, and ran out of memory after about 6k uploads. The fix that worked was recreating the S3 client after the large requests to read the existing objects, and periodically recreating the client every 1000 uploads.

The successful stage run on an instance with 1 CPU and about 2GB of memory:

Rendering and uploading can be done concurrently, so it should be faster on a more capable instance.

jwhitlock commented 3 years ago

This code is now failing in production, with out-of-memory errors when creating the quadtrees. There are 202 million exported rows in production, versus 73 million in stage, and the largest segment is 151 million rows.

One possibility is to export multiple CSV files, for examples with 50 million per file, and then create more than 4 quadtrees, with the hope that merges are more memory efficient. I will use a similar strategy as the previous round, testing changes on the production server while building the PR.

jwhitlock commented 3 years ago

The datamap generation and upload completed in production, on a similar sized instance as stage (1 CPU, 2 GB of memory):

There was one failed deletion, printing this message and continuing:

2021-01-21T00:52:23.668681Z [error    ] Exception while syncing: Error deleting: [{'Key': 'tiles/12/927/1537.png', 'Code': 'InternalError', 'Message': 'We encountered an internal error. Please try again.'}] [ichnaea.scripts.datamap] 

These are transparent images, so they look very similar on a white background. Here's the old top-level tile:

datamap-before

And the new one:

datamap-after

jwhitlock commented 3 years ago

We've configured the celery scheduler node on stage to run the script nightly. It failed to complete, and the log messages are not being captured. The script appears to run manually, so it is probably a configuration issue between the script and the logging infrastructure. Closer!

jwhitlock commented 3 years ago

@Micheletto fixed logging by calling with systemd-cat, getting the logs into syslog. This showed the problem was that cron wasn't allocating a TTY, so docker run -it was failing. The script runs correctly without -it.

In stage, it takes almost 3 hours, with these details:

{
  "updated": "2021-02-23T02:51:50+00:00",
  "export_duration_s": 318.312,
  "row_count": 63406942,
  "csv_count": 49,
  "quadtree_duration_s": 120.355,
  "csv_converted_count": 49,
  "intermediate_quadtree_count": 48,
  "quadtree_count": 4,
  "merge_duration_s": 6.161,
  "tile_count": 186415,
  "render_duration_s": 5256.392,
  "sync_duration_s": 3498.054,
  "tiles_unchanged": 53179,
  "tile_new": 28,
  "tile_changed": 133208,
  "tile_deleted": 150,
  "tile_failed": 0
}

The last step is to deploy to production, which is very similar to stage, so I don't expect problems.

jwhitlock commented 3 years ago

We've pushed the changes to production. The previous run was over a month ago, with these details:

{
  "updated": "2021-01-21T00:54:12+00:00",
  "export_duration_s": 888.223,
  "row_count": 202133184,
  "csv_count": 237,
  "quadtree_duration_s": 424.951,
  "csv_converted_count": 237,
  "intermediate_quadtree_count": 237,
  "quadtree_count": 4,
  "merge_duration_s": 21.767,
  "tile_count": 254122,
  "render_duration_s": 7640.339,
  "sync_duration_s": 7397.963,
  "tile_new": 34804,
  "tile_changed": 219318,
  "tile_deleted": 478338,
  "tile_failed": 1
}

If it runs overnight, we're done with this issue.

willkg commented 3 years ago

You can do it! I believe in you!

jwhitlock commented 3 years ago

We had a successful run last night. The details are:

{
  "updated": "2021-03-05T02:37:06+00:00",
  "export_duration_s": 555.373,
  "row_count": 178715347,
  "csv_count": 213,
  "quadtree_duration_s": 233.658,
  "csv_converted_count": 213,
  "intermediate_quadtree_count": 213,
  "quadtree_count": 4,
  "merge_duration_s": 13.35,
  "tile_count": 247431,
  "render_duration_s": 4539.381,
  "sync_duration_s": 3002.373,
  "tiles_unchanged": 12782,
  "tile_new": 2155,
  "tile_changed": 232494,
  "tile_deleted": 8847,
  "tile_failed": 0
}

I'm a little sad that the total duration is not in the online summary, but I may be one of the few people that care. On the log server, we have a summary log entry (jsonPayload.Type="canonical-log-line" logName="(full path)/docker.ichnaea-map") that also has the total duration, 8,411 seconds, or just under 2 1/2 hours. This is a minute longer than the totals (export_duration_s + quadtree_duration_s + merge_duration_s + render_duration_s + sync_duration_s, or 8,344 seconds). Another hack is knowing that the job runs 17 minutes after midnight UTC, so you can subtract 17 from the updated time to see how long it took.