ukwa / ukwa-pywb

GNU General Public License v3.0
11 stars 3 forks source link

Playback issue with Man-on-a-beach #57

Open anjackson opened 4 years ago

anjackson commented 4 years ago

Having archived this site in WebRecorder: https://manonabeach.com

The WARCs are not quite playing back in our QA Wayback service. the WARC has URLs like:

WARC-Target-URI: https://www.youtube.com/get_video_info?html5=1&video_id=TuX6tqk9PJM&cpn=TezRkAGwLcq-9_EV&eurl=https%3A%2F%2Fmanonabeach.com%2F&el=embedded&hl=en_US&sts=18305&lact=82&c=WEB_EMBEDDED_PLAYER&cver=20200214&cplayer=UNIPLAYER&cbr=Chrome&cbrver=76.0.3809.136&cos=Windows&cosver=10.0&iv_load_policy=3&autoplay=1&width=640&height=360&ei=129KXpfNF8SoVIDVvfAP&iframe=1&embed_config=%7B%7D&co_rel=1&ancestor_origins=https%3A%2F%2Fmanonabeach.com

but proxy-mode playback 404s requesting:

https://www.youtube.com/get_video_info?html5=1&video_id=TuX6tqk9PJM&cpn=82PNlmzPAUrKN6wd&eurl=https%3A%2F%2Fmanonabeach.com%2F&el=embedded&hl=en_US&sts=18305&lact=161&c=WEB_EMBEDDED_PLAYER&cver=20200214&cplayer=UNIPLAYER&cbr=Chrome&cbrver=76.0.3809.136&cos=Windows&cosver=10.0&iv_load_policy=3&autoplay=1&width=640&height=360&ei=129KXpfNF8SoVIDVvfAP&iframe=1&embed_config=%7B%7D&co_rel=1

This might just be because our QA Wayback is a bit out of date compared with Pywb.

anjackson commented 4 years ago

Note the ancestor_origins=https%3A%2F%2Fmanonabeach.com bit. This seems to vary even between browsers.

ikreymer commented 4 years ago

This should be getting fuzzy matched.. but perhaps the fuzzy matching is not working with the outbackcdx configuration? This is with the latest release of outbackcdx, right?

anjackson commented 4 years ago

Yeah, latest OCDX 0.7.0 I think, but older pywb. I can probably try the newer pywb tomorrow.

anjackson commented 4 years ago

Okay, so latest ukwa-pywb:2.4.0-beta didn't help. Looking at the CDX queries, we see:

192.168.45.60 - - [28/Feb/2020:09:51:38 +0000] "GET /data-heritrix?url=https%3A//www.youtube.com/get_video_info%3Fhtml5%3D1%26video_id%3D7WqRVJjNGyY%
26cpn%3DDyctYvUgxjzNF9ul%26eurl%3Dhttps%253A%252F%252Fmanonabeach.com%252F%26el%3Dembedded%26hl%3Den_US%26sts%3D18305%26lact%3D116%26c%3DWEB_EMBEDDED
_PLAYER%26cver%3D20200214%26cplayer%3DUNIPLAYER%26cbr%3DChrome%26cbrver%3D76.0.3809.136%26cos%3DWindows%26cosver%3D10.0%26iv_load_policy%3D3%26autopl
ay%3D1%26width%3D640%26height%3D360%26ei%3DiG5KXsyiNfjmhAfD0bi4Dg%26iframe%3D1%26embed_config%3D%257B%257D%26co_rel%3D1&closest=20200217104425&sort=c
losest&limit=10 HTTP/1.1" 200 5 "-" "python-requests/2.22.0" "-"
192.168.45.60 - - [28/Feb/2020:09:51:38 +0000] "GET /data-heritrix?url=https%3A//www.youtube.com/get_video_info%3F&closest=&sort=closest HTTP/1.1" 20
0 5 "-" "python-requests/2.22.0" "-"

Presumable the second query is intended to be a prefix-based query that lists the different URLs with different query parameters? If so, it should use matchType=prefix, e.g.

http://cdx.api.wa.bl.uk/data-heritrix?matchType=prefix&url=https%3A//www.youtube.com/get_video_info%3F&closest=&sort=closest
anjackson commented 4 years ago

Still hitting this. Page is requesting:

https://www.youtube.com/get_video_info?html5=1&video_id=4Bam5wej-ek&cpn=qNB7KTt3Nr-afJk1&eurl=https%3A%2F%2Fmanonabeach.com%2Fnorth-scotland%2Fsutherland%2Fsango-bay&el=embedded&hl=en_US&sts=18305&lact=36&c=WEB_EMBEDDED_PLAYER&cver=20200214&cplayer=UNIPLAYER&cbr=Chrome&cbrver=76.0.3809.146&cos=Windows&cosver=10.0&iv_load_policy=3&autoplay=1&width=640&height=360&ei=MAtNXvfyLoKCxgLBjIX4BA&iframe=1&embed_config=%7B%7D&co_rel=1

OutbackCDX contains:

com,youtube)/get_video_info?ancestor_origins=https://manonabeach.com&autoplay=1&c=web_embedded_player&cbr=chrome&cbrver=76.0.3809.146&co_rel=1&cos=windows&cosver=10.0&cplayer=uniplayer&cpn=qnb7ktt3nr-afjk1&cver=20200214&ei=matnxvfylokcxglbjix4ba&el=embedded&embed_config={}&eurl=https://manonabeach.com/north-scotland/sutherland/sango-bay&height=360&hl=en_us&html5=1&iframe=1&iv_load_policy=3&lact=171&sts=18305&video_id=4bam5wej-ek&width=640 20200219101727 https://www.youtube.com/get_video_info?html5=1&video_id=4Bam5wej-ek&cpn=qNB7KTt3Nr-afJk1&eurl=https%3A%2F%2Fmanonabeach.com%2Fnorth-scotland%2Fsutherland%2Fsango-bay&el=embedded&hl=en_US&sts=18305&lact=171&c=WEB_EMBEDDED_PLAYER&cver=20200214&cplayer=UNIPLAYER&cbr=Chrome&cbrver=76.0.3809.146&cos=Windows&cosver=10.0&iv_load_policy=3&autoplay=1&width=640&height=360&ei=MAtNXvfyLoKCxgLBjIX4BA&iframe=1&embed_config=%7B%7D&co_rel=1&ancestor_origins=https%3A%2F%2Fmanonabeach.com application/x-www-form-urlencoded 200 RU26TJOW7DW4VA6FDM6K6FALWTLSZBED - - 0 1920069 /1_data/npld/webrecorder/manonabeach/warcs/manonabeach-20200220092950.warc.gz
ikreymer commented 4 years ago

For many youtube urls, there won't be an exact match.. It is fuzzy matching on several query params, in this case just video_id and html5. (https://github.com/webrecorder/pywb/blob/master/pywb/rules.yaml#L376) hat means it should do a prefix query to find this, but perhaps this is not working with OCDX?

ikreymer commented 4 years ago

Also want to mention, if prefix query is unfeasible due to size, it may be possible to do an alternate canonicalization to make video lookup more direct, eg something like https://videos.example.com/?video_id=.... Not sure if that makes sense yet though.. I think OutbackCDX includes some fuzzy matching support already..

anjackson commented 4 years ago

The problem seems to be that pywb is not performing a prefix query (it’s not passing matchType=prefix)

anjackson commented 4 years ago

All good now - working on the DEV services. Will roll out when we can.

anjackson commented 1 year ago

This seems not to be working right, in a couple of different ways.

Firstly, in a clean install/test system, playback works fine. So it is possible!

But, accessing QA Wayback in Firefox, some of the sub-requests are not passing through the authentication header as expected. e.g.

https://www.webarchive.org.uk/act/wayback/archive/20200217104425oe_/https://r1---sn-bvvbax-ac5e.googlevideo.com/videoplayback?expire=1581957870&ei=jm5KXoKTIoeQVrLPpZAG&ip=194.66.237.31&id=o-AGQOq6OSZTzAnnnVigGVGRgQUveSMRzZIHmWcMOd5jPS&itag=18&source=youtube&requiressl=yes&mm=31%2C26&mn=sn-bvvbax-ac5e%2Csn-5hne6nlr&ms=au%2Conr&mv=m&mvi=0&nh=IgppcjAxLmxiYTAzKg4xOTQuODIuMTc0LjIxNw%2C&pl=21&initcwndbps=1851250&vprv=1&mime=video%2Fmp4&gir=yes&clen=23692960&ratebypass=yes&dur=305.272&lmt=1394606022352193&mt=1581936173&fvip=1&fexp=23842630&c=WEB_EMBEDDED_PLAYER&sparams=expire%2Cei%2Cip%2Cid%2Citag%2Csource%2Crequiressl%2Cvprv%2Cmime%2Cgir%2Cclen%2Cratebypass%2Cdur%2Clmt&sig=ALgxI2wwRAIgYTSq4l-9bsEaoXDwnjjooJudP8PrFfIo7pNTUxq-MrkCICjXWCdu0IAPaUnc2DLmxeTkpHLdYrGeuQmF03490pFs&lsparams=mm%2Cmn%2Cms%2Cmv%2Cmvi%2Cnh%2Cpl%2Cinitcwndbps&lsig=AHylml4wRQIhAMT0jY28WS4DgRzqj_f5QTtfFGG_n0V0AW7V6RGcVuvdAiAhXSiUP-lNmuGiwHy5tEo1U4AQRbg-2MI68ZgFcoU1XQ%3D%3D&cpn=oJMXib2j1u3GrJ4Y&cver=20200214&ptk=youtube_none&pltype=contentugc

gets a

401 Authorization Required

You have to be [logged into W3ACT](https://www.webarchive.org.uk/act/login) and be a member of a Legal Deposit Library to access this page.

Oddly, QA Wayback seems to work, sometimes, in Chrome at least.

The OA Wayback fails, but that's for NPLD scoping reasons. To work, we'll need access to embedded URLs like

https://www.webarchive.org.uk/wayback/archive/20200217090605if_/https://www.youtube.com/embed/7WqRVJjNGyY?autoplay=1&controls=1&wmode=opaque&rel=0&egm=0&iv_load_policy=3&hd=0

and the googlevideo.com ones like the one above. e.g. should we let through

(com,youtube,www)/embed/
(com,googlevideo,
anjackson commented 1 year ago

Tagging and building ukwa/python-w3act:2.1.2...

anjackson commented 1 year ago

And updated to run this version in production. If all goes well, this should be resolved in an hour or so.

anjackson commented 1 year ago

Darn, allowed SURT was wrong as it included the www subdomain but that should not be part of the canonical form. Tagging 2.1.3 and switching to that.

anjackson commented 1 year ago

Okay, quite a few embed/transclusion sources needed to be added, but it's working now I think.

anjackson commented 1 year ago

So, the front-page videos seem to be working. The sub-page videos are not generally working, but the one example I looked at is one we don't seem to have at all, so maybe that's the issue?