john-kurkowski / tldextract

Accurately separates a URL’s subdomain, domain, and public suffix, using the Public Suffix List (PSL).
BSD 3-Clause "New" or "Revised" License
1.81k stars 211 forks source link

New snapshot? #282

Open steve-mavens opened 1 year ago

steve-mavens commented 1 year ago

Apologies if I've failed to find this in the docs, but is there any official cadence for how often the PSL snapshot is updated and a new release made?

We tripped over this because of a material change under .museum: so now our online and offline tests get different results for one of our test cases, that happens to be in there.

Obviously our test case is our problem (and maybe offline tests of code that uses tldextract are not a great idea in the first place). But it would be useful to know if it's our problem for a while, or if you were due to update the snapshot fairly soon anyway.

john-kurkowski commented 1 year ago

There's no cadence. It's easy for me to update, so I just did in 6f45fed6c56f377e8a9a77ce43c50712281940d8.

$ curl https://publicsuffix.org/list/public_suffix_list.dat > tldextract/.tld_set_snapshot
john-kurkowski commented 1 year ago

Some possible solutions.

  1. This project continually publishes upon update of the upstream list.
  2. Vendor your copy of the suffix list and point to it in your tests (and/or application) via the suffix_list_urls or cache_dir kwargs; avoid diverging tests online vs. offline.
  3. Test against suffixes that probably won't change in the upstream list, like example.com or example.probablyneverasuffix.
  4. Decouple from testing this library; assume it works; stub it in your tests.
steve-mavens commented 1 year ago

Thanks very much!

(1) sounds like unwarranted effort for you (and might make the changelog a bit spammy?)

(2) is probably what I should do, or a short suffix list file would cover these tests.

(3) Turns out I'm not a perfect judge of what's probable! The case was chosen as a non-ASCII second-level domain listed in the PSL, and .museum seemed stable at the time. Until now that test file was unchanged since written in 2020, so it's not volatile enough to be a real problem.

(4) Would also work, but even when I have that fully isolated test I usually want the integration test as well, so it's a question of whether I can get away with that integration test being offline, or whether I need to be online in order to test that my understanding of tldextract is correct.

Anyway I think in some sense (2) amounts to saying, "tldextract can be its own fake". It's isolatable enough, and it can be configured with any invented cases needed. So if I do that then it's kind of a semantic argument whether I have a true unit test of my function against a fake I didn't write myself, or an integration test of my function + tldextract with a lower-level dependency (the PSL) stubbed. My team doesn't do enough formal test design for that distinction to matter.

Btw, before I used tldextract I had a checked-in copy of the PSL and my own parser. My commit Function to identify public suffix, from Mozilla's list of rules was on 2011-02-11. So if I'd worked on other features for another 17 days I guess I could have saved that effort and used tldextract from the start!

steve-mavens commented 1 year ago

Oh, and I think another possible solution is to run a line of code to let tldextract get a fresh PSL in between installing the test environment (which obviously is an online operation) and running the offline tests (with pytest-socket to enforce offline-ness). I suppose arguably this is just (2) again, with the cache_dir arg rather that the suffix_list_urls arg. Or I could make the PSL's URL an exception to pytest-socket.

john-kurkowski commented 1 year ago

So if I'd worked on other features for another 17 days I guess I could have saved that effort and used tldextract from the start!

😊

I like your breakdown. Yeah, there are tradeoffs in all directions, depending how robust and formal you want your test suite. Your last suggestion with the test suite continually updating the PSL reminds me of this article on verified fakes.

steve-mavens commented 1 year ago

Yes, sounds good. I've also seen (but IIRC never implemented) a variant on that where you put an interception layer in, and generate stub responses by capturing the responses from the run of the live version of the test. So instead of the verified fake you have an "updateable stub". I say never implemented: many times I've set a breakpoint and dumped some http response to disk for use as a test case, never have I properly automated that. There's probably a framework for it.