seomoz / url-cpp

C++ bindings for url parsing and sanitization
MIT License
19 stars 11 forks source link

Handle javascript, mailto, and friends #23

Closed dlecocq closed 8 years ago

dlecocq commented 8 years ago

This brings url-cpp closer to the behavior of Python's urlparse, in the handling of three classes of schemes:

Before this PR:

>>> URL.parse('http://moz.com/').relative('javascript:console.log("hello")').utf8
'javascript://moz.com/console.log("hello")'
>>> URL.parse('javascript:console.log("hello");console.log("world")').path
'console.log("hello")'
>>> URL.parse('javascript:console.log("hello");console.log("world")').params
'console.log("world")'

With this PR (same as urlparse interpretation):

>>> URL.parse('http://moz.com/').relative('javascript:console.log("hello")').utf8
'javascript:console.log("hello")'
>>> URL.parse('javascript:console.log("hello");console.log("world")').path
'console.log("hello");console.log("world")'
>>> URL.parse('javascript:console.log("hello");console.log("world")').params
''

@b4hand @neilmb @lindseyreno

b4hand commented 8 years ago

LGTM

dlecocq commented 8 years ago

The url-py behavior is such that relative comes with an implicit abspath (in fact, it's urljoin that does this operation, so url-py had no control over it). In order to keep the expectations of relative and relative_to the same, url-py adds in an extra abspath implicit to relative_to (https://github.com/seomoz/url-py/blob/0215c7ca698c3f9bcbb7f6a5c3e290264a8bb784/url/url.pyx#L159)

However, these last few commits are to address the fact that the existing url-cpp implementation would erroneously promote path -> /path. We didn't witness this before because we've predominately focused on what the URL looks like when turned back into a string, which hides this issue. But, witness (while trying to update the url-cpp submodule in url-py):

>>> URL.parse('path').abspath().path
'/path'
>>> URL.parse('path').abspath().utf8
'/path'

Or, in the test case I was working on:

>>> URL.parse('javascript:console.log("hello")').relative_to('http://moz.com/').utf8
'javascript:/console.log("hello")'

@b4hand @neilmb -- do you want to take another look at the last four commits?

b4hand commented 8 years ago

LGTM