tedsteiner / OpenStreetMap.jl

Julia OpenStreetMap Package
Other
52 stars 18 forks source link

Unique user-added node IDs #65

Closed tedsteiner closed 9 years ago

tedsteiner commented 9 years ago

There are some times when a user might want to add a new node for a specific location, and others when functions must add new nodes (most notably cropMap!, but maybe others, too). I originally implemented the function to handle this, addNewNode(location), to just count up from 1 until it finds an available node ID. This is obviously inefficient, but I've just never really cared that much. But another problem arises if you try to combine maps - every cropped map has a node "1." Is there an easier way to have unique node IDs?

One thought I had was to do something like hash((location.east,location.north)). I'm not sure whether we'd need to use the second term or not. The only issue I can think of is that if you want to interact with these IDs by name, they are many digits and harder to keep track of mentally.

yeesian commented 9 years ago

Hm I think here's where you'd want to tread carefully, because it'll be nice to eventually be able to re-export the nodes/maps/etc back to osm or shapefiles, and we'd want to be able to (i) distinguish between user-added versus original OSM IDs, and (ii) decide what to do about user-added information when re-exporting.

One way out (which seems reasonable), is to maintain our own set of IDs (which is why I'm still unsettled about https://github.com/tedsteiner/OpenStreetMap.jl/pull/58), running from 1 to N (where N = # of nodes). That way it is easy to add nodes (just push! to the existing list of nodes), and to query whether it was an original OSM node (just check whether there is a corresponding index in the ID->OSM vector).

I'm not sure how cropMap! and everything else works though. It might help to have a coherent approach -- e.g. thinking of it as having a dataframe of nodes (with OSM IDs as one of the columns, and maybe lat/lon as separate columns), where we're simply adding new nodes/records, with missing entries for the OSM IDs. Then cropMap etc are just filters on this dataframe.

tedsteiner commented 9 years ago

So, you probably won't like how things work currently, then. Any time a node is added, anywhere in the package, it is done by calling the addNewNode(nodes::Dict,loc::ENU) function. This function finds an open ID in nodes, puts loc there, and returns the new ID. The implementation inside is that it starts at ID 1 and counts up until it finds an opening. Simple, inefficient, but it's always worked fine.

Functions cropMap!() and interpolateRoute() need to add new nodes, so they do this by internally calling addNewNode() as needed. I'm not sure how much cropMap! is really affected by the inefficiency of adding a bunch of nodes, since it's just highways along the map edges, but I've noticed major slowdowns from interpolateRoute(), but this is only when I'm doing like 50, 50+ km routes, interpolating every 10 meters. Honestly, it hasn't really bothered me because the stuff I go on to use the route for takes a couple minutes, so this is far from my personal bottleneck.

I have changed addNewNode() on my local copy to take an optional argument start_value, which I default to hash(loc). I don't really understand hashing immutable types, but it seems to be consistent, not that it really matters. This gives a unique starting point for addNewNode, and it begins iterating up from there rather than from 1. This is much faster, and a little more unique, but the new IDs are long. It also won't change anyone's functionality, and if they set start_value=1, they get an identical result. Now we don't have the problem of every cropped map having nodes 1-50 in them.

I hear you on this idea about maintaining our own set of IDs. I don't think it's critical, since anyone who really cares can still re-load their original OSM data and check which nodes didn't exist in the original. But tracking it might be more convenient. Currently cropMap!() doesn't return anything. If it returned a list of new IDs added, as something optional you could hold on to, would that solve the problem? Then, if the user desires, they can keep their own list of non-original node IDs, and if we ever add some sort of map-merging feature, we can take in that list of non-OSM nodes and change those IDs if there are any conflicts.

yeesian commented 9 years ago

Mmm, I can see what you're saying, and I'm not as vexed about it as you might imagine -- feel free to continue adding functionality (and tests) if it helps with your work!

I'm just wondering longer-term about the way we'll want to deal with everything (by looking at the way things are done in Python/R), and offload as much of the data-mungling to DataFrames as possible. I'll roll up my sleeves one of these days to get them done, but school and research has caught up to me, so it'll take awhile.

garborg commented 9 years ago

Experimenting with using our own 1-N node indices to try using different graph types sounds interesting.

Ideas you've mentioned elsewhere about bringing in DataFrames to to simplify data analysis sound great, but it would be an extremely heavy dependency for this use case (even if DataFrames development was much farther along).

yeesian commented 9 years ago

Oh we don't have to bring in DataFrames anytime soon, haha, just suggesting it as a way of thinking about the way we organize the indices/data in OSM internally, so that

  1. we have coherent ways of resolving issues (such as these) as and when they arise, and
  2. it might make the conversion to anything else (in the future) easier

Although they are heavy dependencies, the introduction of DataFrames really shouldn't only be for this particular usecase, and should be part of a more general pattern of working with data, based on the lessons picked up from the GIS community in Python. In particular, I quote from the UrbanSim project:

We lean heavily on the PyData community to make our work easier - Pandas, IPython, and statsmodels are ubiquitous in this work. These Python libraries essentially replace the UrbanSim Dataset class, tools to read and write from other storage, and some of the statistical estimation previously implemented by UrbanSim.

This makes our task easier as we can focus on urban modeling and leave the infrastructure to the wider Python community. The Pandas library is the core of the new UrbanSim, which is an extremely popular data manipulation library with a large community providing support and a very helpful book.

Longer term, I'm thinking of OpenStreetMap not as a standalone package, but as a package for modeling and handling OSM data, as part of a larger processing pipeline for working with geospatial data in Julia:

It doesn't require over-engineering on the part of any individual package, just that we start thinking a little harder about the interoperability of julia packages -- and I think we really aren't too far away from performing operations that were traditionally only possible with a RDBMS (i.e. PostGRES + PostGIS + PgRouting) setup.

garborg commented 9 years ago

I like what you're saying about the ecosystem. Still not sold on using DataFrames for this even once we're using it elsewhere (even though as one of the DataFrames developers, I'd like to see them used all over ;)). I've got some more general ecosystem questions I'd like to hear your vision on. I'll open an issue in Turf.

tedsteiner commented 9 years ago

I went ahead an committed the change I mentioned earlier. It's pretty minor in the grand scheme of things, but I think it's better than what I had before.

And of course, I just noticed I forgot to actually rename the function...

tedsteiner commented 9 years ago

So I just noticed this, and I thought it was kind of interesting. The update I made led to 1 fewer line being covered by testing, according to Coveralls. When I looked into it, this was because addNewNode!() never needed to search for a different available node from the initial guess value. So that should be a big speedup!

Also, to clarify one of my earlier comments, there is no function interpolateRoute() in this package. That's in my own code, it didn't seem relevant enough to add it to the package. addNewNode!() is only used by two functions in OpenStreetMap.jl: cropMap! and findIntersectionClusters().

I also had to remove a couple pieces from the tests, where we were testing to see that specific added nodes had specific IDs (due to cropping). I don't think this is something we want to enforce or depend on, so I just removed them. It also seems to have disrupted some of the edge IDs in Network, so I commented out those couple lines of tests, as well. We should make sure function getEdges() works, but we shouldn't have anything actually dependent on specific IDs, so I think that test was also too specific.

These node IDs aren't guaranteed to be truly unique. I thought of maybe hashing the LatLonAlt position as another option to make them truly unique. But I think it solves the problem much better than the previous version, so I'm going to close this. If we ever need truly unique IDs we can open a new issue. The big downside there is that we then have to convert all OSM ids and maintain a mapping between the two.

I agree that the DataFrames do sound like an interesting option further down the line. Unfortunately I don't really have much to add to @yeesian's comments, as I don't have any experience with formal databases, but I'm open to learning more about them.

yeesian commented 9 years ago

Just a little sneak peek at what might possible: I think if we can move towards modelling geometries (see also: https://github.com/tedsteiner/OpenStreetMap.jl/issues/64), i.e.

there are functions in Turf (known as along) that allows you to interpolate within routes, among many other goodies. I have a preliminary port of the library here, but am catching up on research and classes in the meantime.