Closed mnussbaumer closed 4 years ago
@mnussbaumer very open to collaborating on this.
My thinking was that the compressed data and the :dets
file would be created on a dev machine and then using whatever release strategy you use, that file or files would be transferred to any production machine. But of course you are right - that won't help if there's a data update, and dynamically updating data is a design goal. Perhaps maintaining downloadable versions of the :dets
table on a CDN might be one way to go.
You're also right that I did no testing in an umbrella app although not immediately clear to me why it would fail.
Anyway, more than happy to collaborate on this when you have the time. Feel free to through hints and suggestions in this thread and I can also do some work on it.
Here are some notes on the overall process and some initial thoughts on moving forward:
:dets
file.etf.zip
file:dets
and :ets
backends the raw data is saved and the native match specs are used for the lookup. Its quite slow.:dets
and :ets
, the bounding boxes are kept in memory and then the key is used to lookup which is quite fast.Adjust the data management process to store versioned :dets
files on a CDN which would make it easier for consuming. Happy to look at that this weekend. Or if you've got a better strategy I'm definitely open to it.
I'll make sure to ensure_all_started(:inets)
and :ssl
in the mix task.
Add certificate verification code in there which has been on the todo list for a while.
I've not spent any time on projects that use :dets
before so any suggestions you have for hardening that part of the code is welcome. I only spent time on building the file and match specs.
Hey @kipcole9, was almost going to message you on the forum! Great.
My thinking was that the compressed data and the
:dets
file would be created on a dev machine and then using whatever release strategy you use, that file or files would be transferred to any production machine. But of course you are right - that won't help if there's a data update, and dynamically updating data is a design goal. Perhaps maintaining downloadable versions of the:dets
table on a CDN might be one way to go.
In way if you have control of the release building process or you can configure it to build it during release you might be able to ship that within the release itself of course, so having the raw zipped data and then unpacking and building it on release sounds like a good option, but just to confirm, the dets table gets pretty large right? I accidentally committed it to git and the git push was rejected with over the size, and it was a bit more than 600mb. So although doable you probably don't want to be baking such a large file into the release?
Downloading the file from a cdn would have the same issues if that's its size (I imagine the 600mb is correct, since having it in memory goes to 1gb and so on?).
So the other option is to bake in/download the raw data which is a much more manageable size, but this then needs to be decompressed and built during startup which is why I'm inclined to work on this.
You're also right that I did no testing in an umbrella app although not immediately clear to me why it would fail.
In a small project I had started for building a deployer for elixir I had run into the same issue and it's related to the paths.
One way to solve it is:
if Mix.Project.umbrella?() do
[umbrella_root, _] = Application.app_dir(:tz_world) |> String.split("/_build/")
String.to_charlist(umbrella_root <> "/deps/tz_world/priv/timezones-geodata.dets")
else
:code.priv_dir(:tz_world) ++ '/timezones-geodata.dets'
end
On the dets.ex backend file. But this will fail on a release because Mix won't be available. What it seems would be the best way is to provide an overridable function to get the path filenames, this would apply both to dets, compressed data, etc and would allow you to place the resulting compressed data and dets file where you would like, */priv
or other of your own apps.
This is what I mean by "revamp", I think the startup & etc should be re-written in a way that you can pass those options (probably as a struct with defined fields) to the init of the backends and to the tasks.
The :dets shakiness I believe would be more visible while on dev, because you might close/quite the shell abruptly and leave dets unusable, although it can still happen on a running released instance, if it for some reason goes down abruptly while dets is open. This is basically just handling the result of dets open, if it's corrupt you'll get an :error tuple saying it's not a valid dets file, at that point the best is just to erase the file and restart the filling of the table (we could log that happening), because that will fix it and if it's damaged beyond repair it's better than having the app crash due to it.
I see here:
Indexing is the process of identifying the bounding box of each timezone. There maybe more than one bounding box if the timezone geometry is disjoint That gives us data with has a key that is a bounding box and value that is the raw data. All of it in memory so yes - 1Gb.
So we have to hold the entire data in memory in order to build the indexes correct? Would this step be too slow if instead of in-memory we filled a dets table and then looked that up ? And if that worked to lower the mem requirements, have a configurable option in how to build the initial dataset?
So although doable you probably don't want to be baking such a large file into the release?
Oh for sure - that's why I mentioned release process, not necessarily the release artifact. Nearly all the plumbing to download and instantiate the data at application startup exists so I think pre-packaging and putting it on a CDN probably makes the most sense.
The :dets shakiness I believe would be more visible while on dev
Yes, that makes sense and I can do something about that for sure.
So we have to hold the entire data in memory in order to build the indexes correct?
Correct. But I can make it such that thats only an issue for me (as the maintainer), not a library user or runtime. The performance of the backend :dets_with_index_cache
is pretty good and only takes 25Mb of memory for the bounding boxes. I will end to do some work on packaging releases and uploading to a CDN (probably Cloudflair since it has a free tier) but its not too much work.
Therefore, following this thread I propose so far:
Package up the data as a :dets
file, store it on a CDN and have it dynamically loaded at app start or on demand. This is very close to what happens today except instead of raw data, do it for the :dets
file.
Be more resilient in dev when the :dets
file isn't close properly
Be clear that the recommended backend is :dets_with_index_cache
Remove the compiled in default data path. I probably should be using :code.priv_dir(:tz_world)
not the relative path. Bit sloppy of me really. And of course don't do that as a module attribute, make it runtime.
I'll work on this over the weekend - please adjust the plan as required!
@kipcole9 so you mean that after all is done, and the dets table with the index cache is built, that all you need to do the lookups is a 25MB dets table (on disk?)? If that's the case then for my use case I will definitively just build the dets table, put it on my priv. Now there's still the issue of it getting corrupted, but one solution with that file size, would be to make that "original" dets file a "backup" that won't be touched (like appending _backup
or _source
to the filename), and on instantiation copy that file which should be pretty fast and use that copy as the dets table. If it gets corrupted somehow, then all is needed is to overwrite the corrupted with the backup. This would make it much more feasible to host it outside or even pack it in the release/container/repo.
We would still need configurable paths nonetheless.
But then I don't understand why I got such a big dets file because if I'm not mistaken I only used the dets_with_index_cache.
I don't have much time to dig into it right now, but I will in like 2weeks and I'm keen on working on it as it's been really useful (I'm getting addresses with geopoints and then I need to get their timezones).
Because this is what I see after it has finished starting tz_world
so you mean that after all is done, and the dets table with the index cache is built, that all you need to do the lookups is a 25MB dets table (on disk?)
Yes, memory usage is 25Mb - this is for the bounding box index. The on-disk :dets
file is still ~700Mb as you observe but that takes very little memory footprint.
We would still need configurable paths nonetheless.
The path is actually configurable already but its not well documented and I need to fix that. And fix the umbrella behaviour because of the baked-in default.
def compressed_data_path do
Application.get_env(:tz_world, :data_dir, default_data_dir())
|> Path.join(@compressed_data_file)
|> to_charlist
end
would be to make that "original" dets file a "backup"
Yes, that would work I think. I'll test it out.
Yeah, so thinking about my own use case a bit more, probably the best would be to build the table in a step prior to the release assembly as releases support that, copy it into a priv dir as a backup, and then the release would be built with the dets table inside. This works fine with platforms that do the build step as is this particular app I'm working on right now. You could in fact download the original raw data during the build process as well and not need to include the raw data in the repo. It would still need the paths for the umbrella version I imagine and dets could still get corrupted.
On the other hand if you are building the release on a more limited machine you will probably run into the same problems though, or if you need to distribute it, since it will have a gigantic size and in those cases it's probably better to keep only the raw data and assemble the dets table somehow on startup so I think that anyway some improvements can be made.
Anyway, I'll give it a go once I have a bit more time with some patterns I've used and we can still discuss other points or approaches in the meanwhile.
Thanks for taking the time for answering @kipcole9 !
I have published tz_world version 0.5.0 that address most of the issues but not yet the data generation and distribution issue. The changelog entry is:
Move compile time configuration of the data directory to runtime and remove hard-coded default path
Start :inets
and :ssl
applications in the downloader mix task
Add certificate verification when downloading updates to the geo data
Document the :data_dir
and :cacertfile
configuration options it the README.md file
The backends :dets
and :dets_with_index_cache
now open the :dets
file as access: :read
which prevents errors if the file is abnormally closed.
@kipcole9 that was quick!
Great solution for the :dets with access: :read
and adding certificate verification is a welcomed improvement for sure.
I will close this, and get back to you once I have some time to play around with some ideas, check out the data itself and we take it from there if wanted - at least I'll be able to add some info to the readme based on what you suggested and what I ended up doing of ways of building it during a release. 👍
Hi,
I've found this package through a post by @kipcole9 on elixirforum.com and it has helped me greatly in solving the problem of getting timezones from geo points, but I've run into a few issues that I think I can help solve and would like to know if you're open to a bit of a revamp on it.
The issues I identified were:
nil
. This was solvable by adding explicit starts to the:inets
and:ssl
application on the mix taskdets
table - it's easy to get a corrupt file and it will bail when reloading the fileunzipping
of the raw data (doingTzWorld.GeoData.load_compressed_data()
results in a process with around 420mb mem consumption), and the way the data is iterated and inserted adds about 530mb, because when doing a reload of the table it spikes to 950something mb - after finishing it goes down to the 15mb or so, but the spike is enough to get it killed in some platforms that monitor mem consumption so you need to provision the memory - I think this can probably be improved even if at a slight cost in speed of filling.I would like to know if you would be open to some work on these issues? But if so, and specially for the loading part I would love to have some help to understand the form of the data we're reading and what we're taking from that raw data, the passes and the actual insertion. Perhaps we could even create an optimised version of the raw data to be used subsequently (e.g. to be included in releases and what not)?
I would be able to work on it in about 2weeks.