metaodi / osmapi

Python wrapper for the OpenStreetMap API
http://osmapi.metaodi.ch/
GNU General Public License v3.0
212 stars 41 forks source link

too many elements cause 414 error #62

Open jose1711 opened 8 years ago

jose1711 commented 8 years ago

when attempting to download a big number of elements (say - using WaysGet method) ends in 'Request-URI Too Long'. it would be nice if osmapi is able to fight this by allowing to finish the request in chunks

austinhartzheim commented 7 years ago

@jose1711 Could you provide an example Way ID that causes this error to occur? I'd be interested in looking into this more if @metaodi thinks we should support this feature.

jose1711 commented 7 years ago

well... it's really just a looooong list of ways to download that triggers the error like wayid1, wayid2, ..waid999999

metaodi commented 7 years ago

@austinhartzheim feel free to look into that. I think we need a way to stop at some point to avoid an endless loop. Maybe we can use this issue to discuss possible solutions. Do you already have an idea?

In general in think it's good to provide this kind of abstraction, so that a consumer of osmapi doesn't have to care about URL length limits. Something like a generator might come in handy here. I've seen something similar already in the OAI-PMH client implementation of pyoai. Let me know if you want to discuss this more in detail.

austinhartzheim commented 7 years ago

Excellent. I'm busy with final projects/exams at my university right now but I should have time in late December. If someone else is interested in working on this issue before then, feel free to take it.

austinhartzheim commented 7 years ago

Root Cause

I've been looking into this issue and it seems that the URI length limit is not defined in the API server software. Rather, I believe that the limit is imposed by the Apache web server itself. It seems that the length of the HTTP request line is the limiting factor. And Apache limits it to 8190 bytes by default.

This is the default value, which has not been set specifically on the servers. (If we wanted to pursue having the value set explicitly on the servers rather than relying on the default, I believe this Chef file would be the location to do it).

Experimental Verification

The following code shows that a request line of 8190 bytes gives the expected result whereas a request line of 8195 bytes causes the 414 error we are addressing in this issue:

len('GET /api/0.6/waysways=') + len(','.join([str(x) for x in range(1, 1854)])) + len(' HTTP/1.1\r\n')  # 8190
len('GET /api/0.6/waysways=') + len(','.join([str(x) for x in range(1, 1855)])) + len(' HTTP/1.1\r\n')  # 8195

api.WaysGet(range(1,1854))  # 404 error - expected
api.WaysGet(range(1,1855))  # 414 error - not expected

Possible Solutions

Here are some of the most likely solutions.

Discussion

I'm personally leaning towards hardcoding a URI length limit constant, with or without trying to standardize the limit. I believe that the efficiency gains of this approach may be significant. Furthermore, I do not think it is likely that the length limit will be decreased in the future.

I'm interested in hearing your thoughts or alternate solutions.

metaodi commented 7 years ago

@austinhartzheim thank you very much for this very thorough analysis of the problem at hand.

I have a few things to add:

All these points lead me to the conclusion, that I'd prefer a limit with a good default value, that a user of osmapi can override (e.g. in the constructor). If the limit is reached, another request is sent to the OSM API with the remaining items, the results are then put back together and returned to the consumer as "one", so that this whole process is transparent from a users perspective (i.e. they don't notice it).

austinhartzheim commented 7 years ago

I like the idea of retrying the request if we see a 414 error.

I think a good strategy would be to start at ~8000 bytes. Upon encountering a 414 error, we divide that number in half and retry. And if we encounter another 414 error, we divide it in half again to ~2000 bytes. After that, we raise an exception if the request is not successful.

The reason for starting at 8000 is that RFC 7230 recommends that servers support at least 8000 byte request lines.

The reason for ending at 2000 is because this is what browsers support and so almost every server (unless configured otherwise) is likely to support that.

Also, we can add a configuration option to override the default settings. I'm considering setting the number of retries to zero if that is the case (or perhaps we can make that configurable as well).


Also, you mentioned using a generator. Do you want the methods to return a generator instead or should we collect all the results and return them as a list?

metaodi commented 7 years ago

I stumped upon this library for retrying, this might be handy for this use case. About the generators: I quite like the idea of returning generators when we return multiple items.