Improvements to tz_world

mnussbaumer commented 4 years ago

Hi,

I've found this package through a post by @kipcole9 on elixirforum.com and it has helped me greatly in solving the problem of getting timezones from geo points, but I've run into a few issues that I think I can help solve and would like to know if you're open to a bit of a revamp on it.

The issues I identified were:

lack of support for umbrella setups, the downloader will fail, no documentation on options for file paths
afterwards, not sure why, the downloader stopped working with httpc errors where all arguments were nil. This was solvable by adding explicit starts to the :inets and :ssl application on the mix task
frail initiation of the dets table - it's easy to get a corrupt file and it will bail when reloading the file
memory consumption when building the tables - when using any platform that builds a release from a git branch in a container from scratch, converting the data into the dets representation goes to almost 1GB - I imagine the same will happen when rebuilding the tables if needed. Given the size of the dets table it's impractical to store it alongside the repository, but storing the raw tarball of the data to be converted is possible (70so mb, github maxfilesize is 100mb) - this seems to be a mixture of unzipping of the raw data (doing TzWorld.GeoData.load_compressed_data() results in a process with around 420mb mem consumption), and the way the data is iterated and inserted adds about 530mb, because when doing a reload of the table it spikes to 950something mb - after finishing it goes down to the 15mb or so, but the spike is enough to get it killed in some platforms that monitor mem consumption so you need to provision the memory - I think this can probably be improved even if at a slight cost in speed of filling.

I would like to know if you would be open to some work on these issues? But if so, and specially for the loading part I would love to have some help to understand the form of the data we're reading and what we're taking from that raw data, the passes and the actual insertion. Perhaps we could even create an optimised version of the raw data to be used subsequently (e.g. to be included in releases and what not)?

I would be able to work on it in about 2weeks.

kipcole9 commented 4 years ago

@mnussbaumer very open to collaborating on this.

My thinking was that the compressed data and the :dets file would be created on a dev machine and then using whatever release strategy you use, that file or files would be transferred to any production machine. But of course you are right - that won't help if there's a data update, and dynamically updating data is a design goal. Perhaps maintaining downloadable versions of the :dets table on a CDN might be one way to go.

You're also right that I did no testing in an umbrella app although not immediately clear to me why it would fail.

Anyway, more than happy to collaborate on this when you have the time. Feel free to through hints and suggestions in this thread and I can also do some work on it.

kipcole9 commented 4 years ago

Here are some notes on the overall process and some initial thoughts on moving forward:

Download and process data

Download the updated data in geojson form from the timezone_boundary_builder project
Decode the json
Convert it to geojson. At this point we have the raw data but lookup performance is poor because we need to intersect every polygon with the point, so now we move on to indexing
Indexing is the process of identifying the bounding box of each timezone. There maybe more than one bounding box if the timezone geometry is disjoint
That gives us data with has a key that is a bounding box and value that is the raw data. All of it in memory so yes - 1Gb.
Then from the raw data, recreate the :dets file
Save the compressed raw data in the .etf.zip file
The same process is followed when a new data release is detected.

Search strategy

For the :dets and :ets backends the raw data is saved and the native match specs are used for the lookup. Its quite slow.
For the indexed version of :dets and :ets, the bounding boxes are kept in memory and then the key is used to lookup which is quite fast.
In all backends the search strategy is to first select the bounding boxes which are candidate timezones. The search the actual polygons of the selected geometries to find the one that fits. Same search mechanism just first on the bounding boxes and then on the raw underlying geometries.

Some next steps

Adjust the data management process to store versioned :dets files on a CDN which would make it easier for consuming. Happy to look at that this weekend. Or if you've got a better strategy I'm definitely open to it.
I'll make sure to ensure_all_started(:inets) and :ssl in the mix task.
Add certificate verification code in there which has been on the todo list for a while.
I've not spent any time on projects that use :dets before so any suggestions you have for hardening that part of the code is welcome. I only spent time on building the file and match specs.

mnussbaumer commented 4 years ago

Hey @kipcole9, was almost going to message you on the forum! Great.

My thinking was that the compressed data and the :dets file would be created on a dev machine and then using whatever release strategy you use, that file or files would be transferred to any production machine. But of course you are right - that won't help if there's a data update, and dynamically updating data is a design goal. Perhaps maintaining downloadable versions of the :dets table on a CDN might be one way to go.

In way if you have control of the release building process or you can configure it to build it during release you might be able to ship that within the release itself of course, so having the raw zipped data and then unpacking and building it on release sounds like a good option, but just to confirm, the dets table gets pretty large right? I accidentally committed it to git and the git push was rejected with over the size, and it was a bit more than 600mb. So although doable you probably don't want to be baking such a large file into the release?

Downloading the file from a cdn would have the same issues if that's its size (I imagine the 600mb is correct, since having it in memory goes to 1gb and so on?).

So the other option is to bake in/download the raw data which is a much more manageable size, but this then needs to be decompressed and built during startup which is why I'm inclined to work on this.

You're also right that I did no testing in an umbrella app although not immediately clear to me why it would fail.

In a small project I had started for building a deployer for elixir I had run into the same issue and it's related to the paths.

One way to solve it is:

if Mix.Project.umbrella?() do
      [umbrella_root, _] = Application.app_dir(:tz_world) |> String.split("/_build/")
      String.to_charlist(umbrella_root <> "/deps/tz_world/priv/timezones-geodata.dets")
else
      :code.priv_dir(:tz_world) ++ '/timezones-geodata.dets'
end

On the dets.ex backend file. But this will fail on a release because Mix won't be available. What it seems would be the best way is to provide an overridable function to get the path filenames, this would apply both to dets, compressed data, etc and would allow you to place the resulting compressed data and dets file where you would like, */priv or other of your own apps. This is what I mean by "revamp", I think the startup & etc should be re-written in a way that you can pass those options (probably as a struct with defined fields) to the init of the backends and to the tasks.

The :dets shakiness I believe would be more visible while on dev, because you might close/quite the shell abruptly and leave dets unusable, although it can still happen on a running released instance, if it for some reason goes down abruptly while dets is open. This is basically just handling the result of dets open, if it's corrupt you'll get an :error tuple saying it's not a valid dets file, at that point the best is just to erase the file and restart the filling of the table (we could log that happening), because that will fix it and if it's damaged beyond repair it's better than having the app crash due to it.

I see here:

Indexing is the process of identifying the bounding box of each timezone. There maybe more than one bounding box if the timezone geometry is disjoint That gives us data with has a key that is a bounding box and value that is the raw data. All of it in memory so yes - 1Gb.

So we have to hold the entire data in memory in order to build the indexes correct? Would this step be too slow if instead of in-memory we filled a dets table and then looked that up ? And if that worked to lower the mem requirements, have a configurable option in how to build the initial dataset?

kipcole9 commented 4 years ago

So although doable you probably don't want to be baking such a large file into the release?

Oh for sure - that's why I mentioned release process, not necessarily the release artifact. Nearly all the plumbing to download and instantiate the data at application startup exists so I think pre-packaging and putting it on a CDN probably makes the most sense.

The :dets shakiness I believe would be more visible while on dev

Yes, that makes sense and I can do something about that for sure.

So we have to hold the entire data in memory in order to build the indexes correct?

Correct. But I can make it such that thats only an issue for me (as the maintainer), not a library user or runtime. The performance of the backend :dets_with_index_cache is pretty good and only takes 25Mb of memory for the bounding boxes. I will end to do some work on packaging releases and uploading to a CDN (probably Cloudflair since it has a free tier) but its not too much work.

Therefore, following this thread I propose so far:

Package up the data as a :dets file, store it on a CDN and have it dynamically loaded at app start or on demand. This is very close to what happens today except instead of raw data, do it for the :dets file.
Be more resilient in dev when the :dets file isn't close properly
Be clear that the recommended backend is :dets_with_index_cache
Remove the compiled in default data path. I probably should be using :code.priv_dir(:tz_world) not the relative path. Bit sloppy of me really. And of course don't do that as a module attribute, make it runtime.

I'll work on this over the weekend - please adjust the plan as required!

mnussbaumer commented 4 years ago

@kipcole9 so you mean that after all is done, and the dets table with the index cache is built, that all you need to do the lookups is a 25MB dets table (on disk?)? If that's the case then for my use case I will definitively just build the dets table, put it on my priv. Now there's still the issue of it getting corrupted, but one solution with that file size, would be to make that "original" dets file a "backup" that won't be touched (like appending _backup or _source to the filename), and on instantiation copy that file which should be pretty fast and use that copy as the dets table. If it gets corrupted somehow, then all is needed is to overwrite the corrupted with the backup. This would make it much more feasible to host it outside or even pack it in the release/container/repo.

We would still need configurable paths nonetheless.

But then I don't understand why I got such a big dets file because if I'm not mistaken I only used the dets_with_index_cache.

I don't have much time to dig into it right now, but I will in like 2weeks and I'm keen on working on it as it's been really useful (I'm getting addresses with geopoints and then I need to get their timezones).

mnussbaumer commented 4 years ago

Because this is what I see after it has finished starting tz_world

kipcole9 commented 4 years ago

so you mean that after all is done, and the dets table with the index cache is built, that all you need to do the lookups is a 25MB dets table (on disk?)

Yes, memory usage is 25Mb - this is for the bounding box index. The on-disk :dets file is still ~700Mb as you observe but that takes very little memory footprint.

We would still need configurable paths nonetheless.

The path is actually configurable already but its not well documented and I need to fix that. And fix the umbrella behaviour because of the baked-in default.

  def compressed_data_path do
    Application.get_env(:tz_world, :data_dir, default_data_dir())
    |> Path.join(@compressed_data_file)
    |> to_charlist
  end

would be to make that "original" dets file a "backup"

Yes, that would work I think. I'll test it out.

mnussbaumer commented 4 years ago

Yeah, so thinking about my own use case a bit more, probably the best would be to build the table in a step prior to the release assembly as releases support that, copy it into a priv dir as a backup, and then the release would be built with the dets table inside. This works fine with platforms that do the build step as is this particular app I'm working on right now. You could in fact download the original raw data during the build process as well and not need to include the raw data in the repo. It would still need the paths for the umbrella version I imagine and dets could still get corrupted.

On the other hand if you are building the release on a more limited machine you will probably run into the same problems though, or if you need to distribute it, since it will have a gigantic size and in those cases it's probably better to keep only the raw data and assemble the dets table somehow on startup so I think that anyway some improvements can be made.

Anyway, I'll give it a go once I have a bit more time with some patterns I've used and we can still discuss other points or approaches in the meanwhile.

Thanks for taking the time for answering @kipcole9 !

kipcole9 commented 4 years ago

I have published tz_world version 0.5.0 that address most of the issues but not yet the data generation and distribution issue. The changelog entry is:

Bug Fixes

Move compile time configuration of the data directory to runtime and remove hard-coded default path
Start :inets and :ssl applications in the downloader mix task
Add certificate verification when downloading updates to the geo data

Enhancements

Document the :data_dir and :cacertfile configuration options it the README.md file
The backends :dets and :dets_with_index_cache now open the :dets file as access: :read which prevents errors if the file is abnormally closed.

mnussbaumer commented 4 years ago

@kipcole9 that was quick! Great solution for the :dets with access: :read and adding certificate verification is a welcomed improvement for sure.

I will close this, and get back to you once I have some time to play around with some ideas, check out the data itself and we take it from there if wanted - at least I'll be able to add some info to the readme based on what you suggested and what I ended up doing of ways of building it during a release. 👍

kimlai / tz_world