[WIP] support streaming osm files through libexpat

yeesian commented 9 years ago

Just for comparison, when reading an osm file that's ~320MB (extracts available here):

it takes up about 7.6GB RAM using the tree-based parser from LightXML:

julia> @time nodes, highways, buildings, features = getOSMData(filename, nodes=true, highways=true, buildings=true, features=true);
elapsed time: 77.810447564 seconds (7885139620 bytes allocated, 70.46% gc time)

it takes up around 0.5GB RAM using the event-based parser from LibExpat:

julia> @time nodes, highways, buildings, features = getOSMData(filename, true);
elapsed time: 627.010413888 seconds (176812171176 bytes allocated, 89.00% gc time)

However, garbage collection is making it very slow, and I'm not sure how to improve the performance. Any suggestions? @garborg, @mlubin, @IainNZ, @joehuchette

Some numbers:

julia> length(nodes)
2447660

julia> length(highways)
65610

julia> length(buildings)
185823

julia> length(features)
12077

IainNZ commented 9 years ago

@yeesian try running it with http://docs.julialang.org/en/latest/stdlib/profile/#direct-analysis-of-memory-allocation to identify worst spots - also consider disabling gc selectively if its just generating a million of small objects in a tight loop.

yeesian commented 9 years ago

I've tried disabling gc selectively, but haven't yet gotten it to work. On the few occasions where it seemed promising, it blew up in memory; I'll try profiling as you suggest

(btw, if you're looking for a smaller file to play with, @tedsteiner has a small instance on dropbox, mostly for testing)

coveralls commented 9 years ago

Coverage remained the same when pulling f677b74a7a956a52bccb5718527bcb0349ef1118 on yeesian:streaming-xml-parser into 2293c618f599d0ff79d508285671301b95dee257 on tedsteiner:master.

yeesian commented 9 years ago

Sorry for the noise, i did a git revert where I really meant to do a git reset instead.

yeesian commented 9 years ago

Thought you all might be interested to know: here's the latest comparison, when reading an osm file that's ~320MB (extracts available here)

it takes up about 7.6GB RAM using the tree-based parser from LightXML:

julia> @time nodes, highways, buildings, features = getOSMData(filename, nodes=true, highways=true, buildings=true, features=true);
elapsed time: 77.810447564 seconds (7885139620 bytes allocated, 70.46% gc time)

it takes up around 0.5GB RAM using the event-based parser from LibExpat (on master):

julia> @time nodes, highways, buildings, features = getOSMData(filename, true);
elapsed time: 85.566212155 seconds (10574148240 bytes allocated, 65.05% gc time)

mlubin commented 9 years ago

:+1:

garborg commented 9 years ago

Excellent!

On Fri, Jan 2, 2015 at 6:15 PM, Miles Lubin notifications@github.com wrote:

[image: :+1:]

— Reply to this email directly or view it on GitHub https://github.com/tedsteiner/OpenStreetMap.jl/pull/55#issuecomment-68577853 .

yeesian commented 9 years ago

I'm wondering now

~~are the LightXML portions of the codebase still needed?~~ (removed, except for parseMapXML)
should I provide keyword arguments for the user to select those aspects of the map they want? Or just parse everything, and let them discard the portions they don't need

coveralls commented 9 years ago

Coverage increased (+1.15%) when pulling dc730f97aacef3bb9e88151bf96f8b5df7f5a785 on yeesian:streaming-xml-parser into 2293c618f599d0ff79d508285671301b95dee257 on tedsteiner:master.

coveralls commented 9 years ago

Coverage increased (+1.15%) when pulling 5a1d50bf4613e53d75c202260f32370749c47973 on yeesian:streaming-xml-parser into 2293c618f599d0ff79d508285671301b95dee257 on tedsteiner:master.

garborg commented 9 years ago

@yeesian I think letting users select what they want parsed should be on the roadmap. I'm not certain keyword arguments is a granular enough solution in the longterm, but I haven't thought about it much. Anyway, I'd be for implementing something for now or just leaving the design open to adding it later.

yeesian commented 9 years ago

Mm, agreed. I think it's worth having on the roadmap, but is not an immediate priority (at least to me). I'd prefer to get the other aspects of the parsing right (e.g. relations/turn-restrictions) working first, before introducing conditional filtering. That said, if you wish, feel free to have a shot!

One way to implement it would be simply to discard the irrelevant parts within getOSMData and just return the ones specified by the keyword arguments?

yeesian commented 9 years ago

Also, I'm starting to wonder if a Dict might be a better object to return. For users to check what keys/objects are available, and to make it easy for us to introduce new OSM entities

garborg commented 9 years ago

Yep, not trying to suggest someone work on it now. I care about turn restrictions more, too -- that sounds great.

garborg commented 9 years ago

Yes, that sounds good -- returning a tuple that varies based on values of the inputs seems like a bit of an anti-pattern.

coveralls commented 9 years ago

Coverage increased (+1.15%) when pulling de294408a0096e39bc11050f391eeb1e0283294b on yeesian:streaming-xml-parser into 2293c618f599d0ff79d508285671301b95dee257 on tedsteiner:master.

coveralls commented 9 years ago

Coverage increased (+1.15%) when pulling de294408a0096e39bc11050f391eeb1e0283294b on yeesian:streaming-xml-parser into 2293c618f599d0ff79d508285671301b95dee257 on tedsteiner:master.

garborg commented 9 years ago

Oh no, just thought of this -- I often only parse nodes and highways -- instead of a 10% slowdown, it will be much slower in cases like that -- I'm not sure what @tedsteiner does in his work, but he's the heaviest user, so I'll defer to him.

In the meantime I'll run a quick test and report numbers -- and if if it happens people need it a little faster before merging into master, perhaps a feature branch we all could commit to to get it to parity as soon as possible.

yeesian commented 9 years ago

If the numbers are significant, I'll have a look to see what I can do in this PR. I'm guessing that it might be a challenge though.. since (i) the number of features is usually small, and (ii) you need to parse the tags to disambiguate between a building and a highway, which comes only after parsing their nd references, and so you can't "skip over" the parsing (in an event-based system) the way you previously could (in a tree-based system).

That said, it's always nice to have benchmarks to aim for (:

garborg commented 9 years ago

I tried a couple extracts, and I'm seeing performance like you mentioned for parsing everything, and performance 2-3x slower for parsing just the nodes and highways, e.g.:

➜  ~  julia-dev -e 'include("osmp.jl"); time_fun(tree4)'
elapsed time: 38.797660288 seconds (3637203160 bytes allocated, 34.39% gc time)
➜  ~  julia-dev -e 'include("osmp.jl"); time_fun(tree4)'
elapsed time: 30.45888161 seconds (3637203160 bytes allocated, 43.91% gc time)
➜  ~
➜  ~  julia-dev -e 'include("osmp.jl"); time_fun(tree2)'
elapsed time: 11.748997784 seconds (792707816 bytes allocated, 4.55% gc time)
➜  ~  julia-dev -e 'include("osmp.jl"); time_fun(tree2)'
elapsed time: 11.620226646 seconds (792707816 bytes allocated, 4.68% gc time)
➜  ~
Switched to branch 'streaming'
➜  OpenStreetMap git:(streaming)
➜  ~
➜  ~  julia-dev -e 'include("osmp.jl"); time_fun(stream4)'
elapsed time: 31.529423398 seconds (6290493080 bytes allocated, 40.69% gc time)
➜  ~  julia-dev -e 'include("osmp.jl"); time_fun(stream4)'
elapsed time: 31.566031387 seconds (6290493080 bytes allocated, 40.51% gc time)
➜  ~
➜  ~  julia-dev -e 'include("osmp.jl"); time_fun(stream2)'
elapsed time: 31.150185125 seconds (6290493112 bytes allocated, 40.87% gc time)
➜  ~  julia-dev -e 'include("osmp.jl"); time_fun(stream2)'
elapsed time: 31.565353016 seconds (6290493112 bytes allocated, 41.32% gc time)

Notes: While this PR appears to allocate much more than master, it can handle larger files without running out of space. All the garbage created in the current parser happens in parsing buildings and features -- maybe that's avoidable. I'm not sure what to suggest between keeping both parsers around, keeping a separate branch for a little bit, and just merging it. .

yeesian commented 9 years ago

Thanks for the benchmarks! Guess I should not have collapsed the commits; I can revert back to having both parsers tmr, but --

I'd imagine it to be confusing to anyone using openstreetmap to suddenly see the memory requirement spike (when they are in fact asking for fewer OSM objects), so we'll have to be careful with the documentation.

2-3x slower does sound like quite abit though (is it 10 vs 30sec?); I guess I should look into possibilities for speeding the parser up, since I've only worked with node and highways myself so far.

garborg commented 9 years ago

@yeesian I think a note in the documentation is enough there, but...

Do you think having two parsers to keep in sync will slow down development efforts implement other parsing? If that's so, I'd be happy to take slower parsing in exchange for a better graph (as long as the new parsing functionality comes with tests, anyone who needed faster parsing would have a clear path).

@tedsteiner, what's more important to you right now?: a) Fast parsing of (nodes and highways) for extracts that the current parser doesn't brush against memory limits on? b) Parsing of larger extracts, and possibly getting more parsing functionality a bit soone

tedsteiner commented 9 years ago

Sorry about being away for a while, this looks awesome, @yeesian!

@garborg TL; DR: I'd lean towards Option B.

My general thoughts are that:

We probably only want one dependency on an XML reading package, rather than two. Less to break. If it looks like one option is going to win out long-term or is more actively developed, that might be a better option. I chose LightXML because it was easier for me to get running and seemed more stable at the time, but I did so reluctantly because I really wanted streaming XML.
As I see it, the memory bloat of parsing the entire XML tree in memory is probably unacceptable long-term, when considering all use cases. I think I selected the boundaries for my example maps of Boston (the ones linked in the Readme) according to the size of the OSM file. I shrank the boundaries until it fit in memory (that OSM file parses to about 3 GB). Anyone wanting to view maps of an entire state or country is going to run out of memory.
In general, I think a longer loading time is fine, since you only need to load each datafile once per session. And it's frustrating to have gigabytes of RAM eaten up by an XML tree you don't even have access to any more. For some reason, at least on my system, the memory stays allocated until I close Julia.
My understanding is that the OSM XML format isn't really intended for fast parsing, but rather for convenience. Applications requiring fast parsing should probably use the binary format. Somewhere on the roadmap we should maybe add the ability to read these binary files, but I don't really have a use for it myself, since the OSM files are working fine and I don't mind waiting a little while to read them.

garborg commented 9 years ago

That all sounds right to me, so I guess the one worry is that we'd be switching away from a tested package under the JuliaLang organization to an untested package that less people have their eyes on -- on the other hand, easy enough to remedy and switching and adding testing upstream sounds like something that needs to happen eventually anyway.

garborg commented 9 years ago

I opened an issue out of curiosity (not to wait on) to see if support for the streaming API was on anyone's LightXML.jl roadmap.

yeesian commented 9 years ago

FWIW, LibExpat also has some tests, even if it's not very well-documented.

I'm okay with having support for multiple parsers, but we don't yet have a parser-agnostic interface (apart from getOSMData and parseMapXML). Longer-term, I don't think tree-based parsing is the way to go for OpenStreetMap anyway (for reasons @tedsteiner has mentioned), and so it doesn't really hurt to remove the existing (tree-based) implementation for LightXML. If and when they do have support for a streaming API, we can probably revisit this discussion, although I don't think the performance difference will be all that much --

My hunch is: this package will probably profit more from understanding/modelling the different OSM entities (e.g. https://github.com/tedsteiner/OpenStreetMap.jl/issues/56), and their relationships, before we embark on the work of conditionally filtering. As it is, we're already leaving out alot of information in the OSM file, that other people might be interested in.

Moving forward, I think it might be nice (from time to time), to abstract out the backends abit, e.g.

[Plotting/Visualization] to use Gadfly/Compose, apart from Winston
[Routing/Algorithms] to use JuMP, or maybe even OSRM, apart from Graphs
[Parsing] LibExpat versus LightXML, etc

garborg commented 9 years ago

Sounds like we're all on the same page.

Good to see the package has tests -- it will be quick work to get Pkg.test("LibExpat") working, Travis set up, and PkgEval to test nightly. I'll submit a PR there for that now, and to get 0.3 support, so this PR will apply to 0.3.

Thanks again for getting this working!

garborg commented 9 years ago

Compatibility / testing PR submitted upstream: amitmurthy/LibExpat.jl#23

yeesian commented 9 years ago

Nice, thanks!

tedsteiner commented 9 years ago

@yeesian @garborg Is there a reason why string fields in OSMattributes have type UTF8String rather than just String or ASCIIString? I'm getting an error on my machine (MacOS, tried Julia 0.3.1 and Julia 0.3.5) when I try to run the tests: MethodError(convert,(UTF8String,\"Alewife\")), , 0, 0, 0" When I changed the fields in OSMattributes to String this error went away and the tests all passed again. It seems like the problem is that the xml file elements are being read as ASCIIString but there is no convert function available from ASCIIString to UTF8String (which seems surprising to me).

Could it be that the XML tags are being read as different types on different architectures? It looks like that test is passing on the Travis Linux system.

garborg commented 9 years ago

@tedsteiner I believe I've run tests and some downstream code on OSX since this was merged, but perhaps only on 0.4 -- I'll see how 0.3 is looking here on my laptop and let you know.

yeesian commented 9 years ago

Apologies, yeah, it was originally just String, but some discussion with @mlubin led me to think that it'll improve performance to have UTF8String, as a concrete type, rather than using String. ASCIIString won't work here, since some of the attributes have chinese characters. Although the bottleneck was later found to be in libexpat, I haven't yet changed back to using String.

garborg commented 9 years ago

@tedsteiner On the latest commit(367af58f07ff035de364852766b8e8e502dc4884), tests pass for me on the heads of both Julia release-0.3 and master for OSX, as they did on Travis for Linux. I'm not sure why it's giving you trouble.

By all means, we can loosen up the type if it's getting in the way of your work, but I'm generally a fan of letting the type system work its magic and generate as specialized of code as possible, even when it's not the main bottleneck.

tedsteiner commented 9 years ago

@garborg Below is the error message I'm getting on OS X. I was able to run the tests on my Linux machine without any problems last night. Does anything stand out to you?

It seems to be reading all of the tags as ASCIIStrings for me, then it is unable to convert them to UTF8Strings. Which seems silly, because why couldn't we do that conversion? But it probably just hasn't been written. When I change over to just using Strings, this error goes away and everything works. But I do agree with your statement that we should let the type system generate as specialized of code as possible.

julia> Pkg.test("OpenStreetMap") INFO: Testing OpenStreetMap testing read_data ... ERROR: "MethodError(convert,(UTF8String,\"\")), , 0, 0, 0" in convert at base.jl:13 in reset_attributes! at /Users/mirage/.julia/OpenStreetMap/src/parseMap.jl:44 in collectValues at /Users/mirage/.julia/OpenStreetMap/src/parseMap.jl:192 in streaming_end_element at /Users/mirage/.julia/LibExpat/src/streaming.jl:90 in XML_Parse at /Users/mirage/.julia/LibExpat/src/lX_common_h.jl:4 in parsefile at /Users/mirage/.julia/LibExpat/src/streaming.jl:221 in getOSMData at /Users/mirage/.julia/OpenStreetMap/src/parseMap.jl:215 in include at /Applications/Julia-0.3.5.app/Contents/Resources/julia/lib/julia/sys.dylib in include_from_node1 at /Applications/Julia-0.3.5.app/Contents/Resources/julia/lib/julia/sys.dylib in anonymous at no file:56 in include at /Applications/Julia-0.3.5.app/Contents/Resources/julia/lib/julia/sys.dylib in include_from_node1 at loading.jl:128 in process_options at /Applications/Julia-0.3.5.app/Contents/Resources/julia/lib/julia/sys.dylib in _start at /Applications/Julia-0.3.5.app/Contents/Resources/julia/lib/julia/sys.dylib while loading /Users/mirage/.julia/OpenStreetMap/test/read_data.jl, in expression starting on line 14 while loading /Users/mirage/.julia/OpenStreetMap/test/runtests.jl, in expression starting on line 11

garborg commented 9 years ago

Huh, that seems like a bug in Julia 0.3.5... Let me download the app and see if I can reproduce. Perhaps we managed to trigger that "Convertacolypse" error.

Probably unrelated, but I just tagged a new version LibExpat, so probably worth running Pkg.update to get Yeesian's speedup and dual 0.3 and 0.4 support.

tedsteiner commented 9 years ago

So here's something weird.

First, I tried loading data with streets including Chinese characters, to see if that triggered the tags to be read as UTF8Strings. I got the same error.

Then, I tried doing the following:

Pkg.test("OpenStreetMap")
using OpenStreetMap
nodes, hwys, builds, feats = getOSMData("filename")

The test fails with the error message above, but then getOSMData works as expected.

Running getOSMData in a fresh Julia session fails, and it continues to do so if you try it repeatedly. So there must be something about how the test is either spinning up libexpat or crashing it, in that it gets it to start reading as UTF8Strings after that (or, possibly, the convert function magically engages).

garborg commented 9 years ago

Shoot, I can't replicate this running the 0.3.5 app with OpenStreetMap checked out, with LibExpat on either v0.0.6 (newly tagged) or v0.0.4 (last 0.3 compatible version).. frustrating. At this point, I'd say switch to using String for now, but I just checked and it looks like that adds 25% overhead.

tedsteiner commented 9 years ago

Do you know what version of libexpat you're using? My Mac has libexpat.1.5.2.dylib in /usr/lib/, and I'm assuming it's using that. It looks like Ubuntu 14.04 uses version 2.1.0. I tried brew install expat, which gave me version 2.1, but I still got the same error. I don't have any issues on my Linux machines.

Now I'll admit I don't really understand LibExpat.jl, but here's something that does concern me a little. In streaming.jl, it looks like the elements are passed around as a String, then we are making the assumption that it is really a UTF8String and treating it as such in parse_tag(). Isn't this asking for trouble?

Since we're the ones introducing the type conflict, I think the bug is in OpenStreetMap.jl not LibExpat.jl. If we want a robust speed up, I think we need to get LibExpat.jl to be able to guarantee that its output is a UTF8String, before we can treat it as such in all cases.

I think that robustness needs to be the top priority, even though I want to parse files 25% faster. This is a pretty serious bug for anyone who shares my situation (they won't be able to read OSM data), and I don't want this to be a new user's first experience with OpenStreetMap.jl.

It seems like the options are:

Use String
Get LibExpat to use UTF8String (which will push the bug into their court)
Use UTF8String and provide (or request?) a conversion from ASCIIString

Thoughts? I'm leaning towards Option 1 until Option 2 becomes available.

yeesian commented 9 years ago

I'll lean towards using String for now too. We can optimize for speed later

garborg commented 9 years ago

That sounds good. For the record, though, there shouldn't be anything stopping us from changing the code right back once the issue disappears:

I don't think any of our code is the issue, it seems like a configuration issue or that somehow Ted's machine but not mine or Travis is triggering the tuple/convert bug that's been messing with 0.3 and 0.4 for so long.

@tedsteiner The line you linked to in LibExpat.jl is not a problem -- there's a ::String annotation, but the function returns either an ASCIIString or a UTF8String and the annotation does not throw an error or cause a conversion -- it just passes them along unchanged after checking that they are Strings, which they are.

tedsteiner commented 9 years ago

@garborg Yes, it's not a problem for LibExpat.jl, but it is a problem for us, right? Because LibExpat is happy to work with either ASCIIString or UTF8String, but we're only content when it delivers an ASCIIString. It seems risky for us to assume it's a UTF8String coming through, when it could be either.

garborg commented 9 years ago

We're forcing conversion from ASCIIString to UTF8String, ala,

type A
       x::Int
end

A(1.0) # -> A(1)

which is not ideal for files that will have only ASCIIStrings, but better than keeping it at String because of how that keeps all our downstream methods from inlining specialized methods during compilation (maybe could cause type instability, too?).

The fact that the convert function is broken for you suggests a known scary bug in Base Julia where using a library or user code that creates a pair or a tuple (in certain, but legitimate ways) can magically break convert: https://github.com/JuliaLang/julia/issues/8818 https://github.com/JuliaLang/julia/issues/8631

Here's my package status running off a clean install of OpenStreetMap -- older versions of some packages, like Color, could trigger the bug, so if we're lucky, you just have an old version of one of these packages:

julia> Pkg.status()
1 required packages:
 - OpenStreetMap                 0.7.0+             master
16 additional packages:
 - BinDeps                       0.3.7
 - Cairo                         0.2.21
 - Color                         0.3.15
 - Compat                        0.2.9
 - DataStructures                0.3.5
 - FixedPointNumbers             0.0.6
 - Graphs                        0.5.0
 - Homebrew                      0.1.13
 - IniFile                       0.2.4
 - JSON                          0.4.0
 - LibExpat                      0.0.4              pinned.3ee8e669.tmp # I also tested against 0.0.6
 - LightXML                      0.1.9
 - SHA                           0.0.3
 - Tk                            0.2.16
 - URIParser                     0.0.3
 - Winston                       0.11.7

Otherwise, I give up, and perhaps we can switch back after it fixes itself someday, or perhaps getting BinDeps up for LibExpat would be a good side project for one of us.

tedsteiner commented 9 years ago

Yes! That's it! I had Color pinned at an old version to try to fix that devil bug from a long time ago. I went through and cleaned out some other packages I don't use anymore, either, and now I can parse with UTF8String.

garborg commented 9 years ago

Awesome! That is a really nasty bug, luckily people are chipping away at it, and fingers crossed it's fixed completely soon. Glad it wasn't indicative of a LibExpat issue that we'd have to keep worrying about. Sorry it took me so long to think of the fix.

tedsteiner / OpenStreetMap.jl

[WIP] support streaming osm files through libexpat #55