grote / osm2gtfs

Turn OpenStreetMap data and schedule information into GTFS
GNU General Public License v3.0
99 stars 31 forks source link

OSM query improvements #2

Closed pantierra closed 8 years ago

pantierra commented 8 years ago

Hi,

I think the current bounding box approach is not ideal. Transit networks can overlap and do not end necessarily at squared boxes.

I'd suggest to run a query to obtain relations with

(and probably maintain the bounding box optional to just make the overpass api query more performant).

Does this make sense? I'd happy to provide a pull request for this. But I also think it's best to get feedback before. Thanks!

grote commented 8 years ago

Thanks for creating an issue and discussing first before starting to code! :)

You definitively have a point here. However, in my area, the network tag is not properly maintained and would need to be added to all routes first. Do you have an issue with the current bounding box approach in your area?

Like you say, we should still keep a bounding box for performance reasons. How about allowing to (optionally) also add a network tag to the queries?

The public_transport:version should also be optional since it is not always well maintained.

pantierra commented 8 years ago

Generally, I think we should engage people towards using properly tagging and sticking to standards.

I'm fine with using the network tag optionally on queries. This works for the problem just as good as doing it the other way around.

Although the public_transport:version=2 I suggest to include. Historically, In OSM there are two data structures for public transport (http://wiki.openstreetmap.org/wiki/Key:public_transport:version) and a lot (!) of inconsistencies. The schema version two is an necessary and collaborative effort to create a standard on how to map public transport om OSM. I don't think it's a problem to include it to existing relations. Especially because JOSM needs it to create and verify public transport data, it should be on the relations, if not they are likely to have a non-standard structure and would likely cause such scripts to fail.

grote commented 8 years ago

Alright, do you want to work on adding the optional network tag to the queries?

I agree that we should aim at making public_transport:version=2 mandatory, but I doing this now would make this script stop working for me. So unless there's a way to easily batch-edit all my routes, I think it should be at least possible to exclude this from the query.

Also when people start playing with the script, they might have a similar data quality situation and want to assess its feasibility for their use-case, without editing all their routes manually before they can do that. So being able to at least turn it off is probably a good idea.

pantierra commented 8 years ago

Yes, I'm happy to provide a patch. But probably this not going to happen within the next days, because of the OSM conference marathon: State of the Map and HOT Summit next week. If anybody else works on it before, that's also good.

Don't worry, with JOSM, it's very easy to add a tag to a bunch of relations at once.

We could also do a little writeup on how people can assure to have good public transit data in OSM. That would help people to improve data quality easily and we could rely on a common ground of data structure for the script.

pantierra commented 8 years ago

I observed a hard-coded backlist in code (see #11) and this seems to me to be a good example of why we want to improve the queries, not only by selecting by bounding box.

I think we can be even more flexible and introduce in the query part of the configuration file (see #3) a dictionary of tags to respect in the query. This then could be network=SOMETHING or even some custom tagging like blacklisted=yes or whatever people use to select the routes they need.

So, I'd look like this, with the bbox being required and all tags to be optional.

{
    "query": {
        "bbox": {
            "e": "-48.2711",
            "n": "-27.2155",
            "s": "-27.9410",
            "w": "-49.0155"
        }
        "tags": {
            "public_transport:version": "2",
            "network": "BR-Floripa"
        }
    }
}
grote commented 8 years ago

I like the idea with the query tag dictionary.

pantierra commented 8 years ago

Just a note: As we already have a case with blacklisted routes (#11), which have a fixme=* tag, we should also consider on including tags to exclude. Or - to keep it simple - just exclude by default all routes that have such a tag.

grote commented 8 years ago

Let's keep it simple and just hardcode the fixme tag. What we could do though is to not exclude by fixme from the query, but include those routes in the query and print warnings for them and say that they were not included, so users of the script know which routes to fix.

jamescr commented 8 years ago

In my case I'm querying train route (that's why route_type is a parameter in some methods).

Just to be sure, adding the "route_type" key/value to the config[query][tags] should work?

pantierra commented 8 years ago

Yes, the idea is to allow all kinds of tags to use for the query and not being limited to any special use case. Once it's implemented it will allow you to query for train routes as well.

grote commented 8 years ago

Maybe it would be a good idea that you leave a message in an issue that you start working on to prevent that two people work on the same issue at the same time.

pantierra commented 8 years ago

I haven't started yet. It looks to be easy to do. I can do it later or tomorrow. If anybody else wants to start before. Just tell here. Thanks!

jamescr commented 8 years ago

I'm going to start with this in a branch called "issue-2" on my fork.

jamescr commented 8 years ago

I haven't finish with this but can't continue right now, I committed to https://github.com/jamescr/osm2gtfs/tree/issue-2 which contains:

note: the value for the query tag "routetype" is passed as param in the querying methods. I haven't changed that by now just to "have" some key/value that reduce the returned relations in case that the tags param are empty. What about if we use the "publictransport:version" : "2" key/value for this reducing results purpose?

jamescr commented 8 years ago

I have updated the methods get_route_masters and get_stops_of_route. I test it with fenix.json config file with no elements on the tags dictionary because that is how it was working before.

grote commented 8 years ago

@xamanu has added tags to all routes, so you should get the same result with these query tags:

        "tags": {
            "public_transport:version": "2",
            "network": "Sim",
            "route": "bus"
        }
jamescr commented 8 years ago

Oh really, That's right, I'll test it with those tags.

jamescr commented 8 years ago

Great, it's working nice :smile:

I'll fix/improve some minor things and create the Pull Request.

pantierra commented 8 years ago

This is related to the refactoring of OsmHelper #31

pantierra commented 8 years ago

I had a look on the work of @jamescr and he basically did two (very good) things. which I hope get into the the code soon:

grote commented 8 years ago

XML queries are deprecated? Too bad. I found them much easier to read and understand :(

pantierra commented 8 years ago

I expected them to be deprecated, but I actually see there is no formal message anywhere about this. So no, they are not deprecated! Sorry for the confusion. Probably it's just a personal preference then. For me QL is much easier to understand.

pantierra commented 8 years ago

As a next step, I'd like to think about doing better queries. Therefore I made a list of queries that are in code currently and then we can think about being more efficient on them:

  1. Get all routes (variants) from bbox with some tags
  2. Get all route masters inside bbox
  3. Get all members of found route masters
  4. Get stops of specific route by relation id
  5. Get one route for refresh

Query number 4 runs repeatedly and is responsibly for the "too many requests" exception I was facing so many times. I think we can start thinking about maybe having one query for 1-4 all together and even the 5th would probably only be a variant with another parameter. Let me get my head around it and come up with a proposal.

And then there is another query, that hits the Overpass API very often. But this is very Fenix specific and I have no real idea on how to optimize this:

grote commented 8 years ago

If you could really do 1-4 in one single query that would be awesome! Otherwise you also could do:

  1. Get all routes (variants and masters) within bbox with tags
  2. Get all stops in bbox with tags

It would probably be best to do this right within the existing fenix creators with minimal changes again.

grote commented 8 years ago

There should probably be some sort of switch for the stop name finder, so it can be turned off easily. The stops will just have the default no-name then.

pantierra commented 8 years ago

How does this look like?

You always have to click on the Run tab on top, ... then wait ... the data can be inspected through the Data and Map tab

We could also do nice things like this:

grote commented 8 years ago

Unbelievable this query magic! You are clearly a magician! :smile_cat:

Can't wait to see those implemented! When this is done and you use this language, it would be nice to have a comment that explains what each line does in case somebody else needs to maintain the query at some point.

Looks like this returns everything we need, so can even throw the OsmApi dependency away.

I don't know what's the best way to implement this. Maybe you prefer to implement the default creators with that and I'll adapt Fenix to that?

pantierra commented 8 years ago

Good question.

The more I think about it the more it appears to me to be the root of everything. We probably wouldn't be able to implement the query improvements like this without implementing an overhaul of the data structure #30 and the refactoring the OsmHelper #31. On the other side doing refactoring of data structure and OsmHelper without optimization of the query would be tired as I'm not able to test it thoroughly with the current queries. Simply using cached stuff is not a solution when working on these parts.

Doing all together would become a bigger pull request, which we generally don't want, but doing it the Fenix way and not generic would cost us a lot of time and nerves (sorry, bloody German saying here). So I'm kind of opting to do the three things in one rush, test it properly with Fenix and then merge it.

What do you think? Any other idea? Suggestions?

pantierra commented 8 years ago

ok. I talked with @jamescr today and we think we found a way to keep on doing it step by step. So we propose to do one first pull request with the following:

grote commented 8 years ago

Sounds good. If you can at all split up those 4 points further into separate PRs this would be even better.