scrapy / w3lib

Python library of web-related functions
BSD 3-Clause "New" or "Revised" License
392 stars 104 forks source link

should `canonicalize_url` treat path parameters like query string parameters? #92

Open jvanasco opened 7 years ago

jvanasco commented 7 years ago

This is sort of an edge case as very few websites use path parameters anymore, however some do.

For those unfamiliar, they're contained in urlparse()[3] or urlparse().params. The RFCs basically describe them as parameters specific to the last path segment and can be kwargs or raw values.

Very few systems still use it, but some do. For example, Amazon used them from launch until the early 2000s to handle cookieless-sessions and much of what is now in query-strings. A handful of java servers use them for sessions too (e.g JSESSIONID).

ghost commented 6 years ago

Wow, TIL there are path parameters with unspecified behaviour in URIs. Thanks! :)

So, I think the view here is that if these are not currently handled, then a PR to handle them would be great. But, as the document above points out, they are different from query parameters in a few ways. One of them is that, as they are still part of the path, it's assumed that order is important. So, I suppose it would be incorrect to create a canonicalised order for these parameters.

Also, it's worth researching what escaping/encoding rules are used for these parameters.

If this is something you're still interested in, perhaps you can propose a solution that best fits the standards / RFCs, as well as the limited real-world uses of this little-known standard?