seomoz / url-cpp

C++ bindings for url parsing and sanitization
MIT License
19 stars 11 forks source link

BIG-3884 - Port PSL #25

Closed dlecocq closed 8 years ago

dlecocq commented 8 years ago

Hacked this up over the weekend.

The interpretation of url-py, moz-util-py and this now all differ when asking for the PLD of a TLD:

# This returns the TLD, PLD, FQDN in reversed form
>>> matcher.get_all('co.uk')
('uk.co', 'uk.co', 'uk.co')
>>> URL.parse('http://co.uk/').pld
'co.uk'
>>> URL.parse('http://co.uk/').tld
'uk'

In this implementation, however, such a request raises an exception:

PSL psl = ...;

# This doesn't raise an exception
psl.getTLD("co.uk");

# This /does/ raise an exception
psl.getPLD("co.uk");

That was loosely based on the expectations set out in the tests listed on the PSL site. Those tests would have us find null in exactly this case. I don't know what the most idiomatic thing to do in C++ for that case is, so until we figure that out, it throws an exception.

Oh, also, it works for both punycoded and unpunycoded forms without doing any transformations.

@b4hand @lindseyreno @tammybailey @neilmb

dlecocq commented 8 years ago

Reupped

b4hand commented 8 years ago

LGTM

dlecocq commented 8 years ago

I guess my main question here is -- is raising an exception when there's no PLD the right thing to do?

b4hand commented 8 years ago

To be honest, I'm not sure. It's not very exceptional, and in C++ they don't lean on exceptions like in Python. However, I'm not sure what a better interface is. I don't like returning an empty string, and changing it to return by pointer is worse as well. If only we had the maybe monad. :)

dlecocq commented 8 years ago

Sounds like this is a question for @martin-seomoz , perhaps. Maybe we have a constant Url::PSL::NotFound that we return (like std::string::npos). One situation that I'd like to avoid is:

// Neither really has a PLD, but they shouldn't have the _same_ PLD
psl.getPLD("com") == psl.getPLD("org")
b4hand commented 8 years ago

Unless you make it an entirely different type from std::string. There's no way to provide that guarantee. There's no string value such that is equivalent to NaN.