harvard-lil / perma

Indelible links
411 stars 70 forks source link

Format submitted URLs with PreparedRequest.prepare_url instead of requests.utils.requote_uri #3443

Closed rebeccacremona closed 7 months ago

rebeccacremona commented 7 months ago

See ENG-493.

We have long since been running submitted URLs through requests.utils requote_uri before handing them off to the browser for capture.

When we began using Scoop and the Scoop API to to produce archives, that added another layer of URL validation: the Scoop API runs its own validation check before accepting the job.

This surfaced some interesting behavior: turns out that we have been passing URLs with [ and ] left un(percent)encoded, all this while.

That is generally recognized as invalid: rejected by curl, etc.... though surely some websites break the rules in querystrings, hashes, etc. And so, it is, quite reasonably, rejected by the validator used by the Scoop API.

I looked into: a) getting Perma's validator to reject, for parity b) getting Scoop's validator to accept, for parity c) why we are using requote_uri anyway d) why you can pass URLs with unencoded brackets into python requests and not have a problem

That investigation uncovered a more useful URL-formatting method inside requests, PreparedRequest.prepare_url. It calls requote_uri, but also does a bunch of other stuff, including work to handle unicode/punycode domains.

I compared the output of the two for 4+ million Perma Links, and examined the cases that differed. PreparedRequest.prepare_url seems unquestionably superior.

This PR arranges for Perma to attempt PreparedRequest.prepare_url, and fall back to requote_uri if that fails for any reason. I decided to try/except with a fallback to our current behavior because I don't, at this moment, want to change the way that URL validation failures look in Perma at all: that feels too invasive.

This is unlikely to materially affect any attempted captures:

But... by capturing my silly test example, http://example.com/angle-bracket=[, I came to realize that we have been setting the primary capture's URL, which is used for playback, to the vanilla submitted_url and not to ascii_safe_url, the actual URL handed to the capture engine and browser.

I believe that to be in error.... though I don't believe that, up until this point, any playbacks have been affected. (There is some Webrecorder/RWP fanciness involved that appears to have been taking care of it for us.) But, with this change, we do in fact need to be pointing playbacks at the ACTUAL url passed to Scoop, and so I updated the primary capture to:

                # create primary capture placeholder
                Capture(
                    link=link,
                    role='primary',
                    status='pending',
                    record_type='response',
                    #url=link.submitted_url,
                    url=link.ascii_safe_url,
                ).save()

I tested with a bunch of URLs, including our favorite https://jamalouki.net/موضة/آخر صيحات الموضة/نقشة الورود ستكسو إطلالتكِ بالكامل في الخريف المقبل, and didn't spot any problems.

While this could turn out to be a more extensive change than I'm currently aware of (though I really think, since I ran against comparisons against over 4 million links, it probably isn't)... I think we'll want to find that out by hearing about concrete examples that don't work from users, rather than wracking our brains trying to think of things to try.

This PR should a) stop the validation errors we are getting from Scoop, and b) possibly allow the successful capture of more URLs that users paste into Perma's input field... even if they are a little messed up.

codecov[bot] commented 7 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (7d7cbf8) 70.38% compared to head (065aa8d) 70.39%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## develop #3443 +/- ## =========================================== + Coverage 70.38% 70.39% +0.01% =========================================== Files 52 52 Lines 6801 6804 +3 =========================================== + Hits 4787 4790 +3 Misses 2014 2014 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.