Open anjackson opened 4 years ago
Note the ancestor_origins=https%3A%2F%2Fmanonabeach.com
bit. This seems to vary even between browsers.
This should be getting fuzzy matched.. but perhaps the fuzzy matching is not working with the outbackcdx configuration? This is with the latest release of outbackcdx, right?
Yeah, latest OCDX 0.7.0 I think, but older pywb. I can probably try the newer pywb tomorrow.
Okay, so latest ukwa-pywb:2.4.0-beta
didn't help. Looking at the CDX queries, we see:
192.168.45.60 - - [28/Feb/2020:09:51:38 +0000] "GET /data-heritrix?url=https%3A//www.youtube.com/get_video_info%3Fhtml5%3D1%26video_id%3D7WqRVJjNGyY%
26cpn%3DDyctYvUgxjzNF9ul%26eurl%3Dhttps%253A%252F%252Fmanonabeach.com%252F%26el%3Dembedded%26hl%3Den_US%26sts%3D18305%26lact%3D116%26c%3DWEB_EMBEDDED
_PLAYER%26cver%3D20200214%26cplayer%3DUNIPLAYER%26cbr%3DChrome%26cbrver%3D76.0.3809.136%26cos%3DWindows%26cosver%3D10.0%26iv_load_policy%3D3%26autopl
ay%3D1%26width%3D640%26height%3D360%26ei%3DiG5KXsyiNfjmhAfD0bi4Dg%26iframe%3D1%26embed_config%3D%257B%257D%26co_rel%3D1&closest=20200217104425&sort=c
losest&limit=10 HTTP/1.1" 200 5 "-" "python-requests/2.22.0" "-"
192.168.45.60 - - [28/Feb/2020:09:51:38 +0000] "GET /data-heritrix?url=https%3A//www.youtube.com/get_video_info%3F&closest=&sort=closest HTTP/1.1" 20
0 5 "-" "python-requests/2.22.0" "-"
Presumable the second query is intended to be a prefix-based query that lists the different URLs with different query parameters? If so, it should use matchType=prefix
, e.g.
http://cdx.api.wa.bl.uk/data-heritrix?matchType=prefix&url=https%3A//www.youtube.com/get_video_info%3F&closest=&sort=closest
Still hitting this. Page is requesting:
https://www.youtube.com/get_video_info?html5=1&video_id=4Bam5wej-ek&cpn=qNB7KTt3Nr-afJk1&eurl=https%3A%2F%2Fmanonabeach.com%2Fnorth-scotland%2Fsutherland%2Fsango-bay&el=embedded&hl=en_US&sts=18305&lact=36&c=WEB_EMBEDDED_PLAYER&cver=20200214&cplayer=UNIPLAYER&cbr=Chrome&cbrver=76.0.3809.146&cos=Windows&cosver=10.0&iv_load_policy=3&autoplay=1&width=640&height=360&ei=MAtNXvfyLoKCxgLBjIX4BA&iframe=1&embed_config=%7B%7D&co_rel=1
OutbackCDX contains:
com,youtube)/get_video_info?ancestor_origins=https://manonabeach.com&autoplay=1&c=web_embedded_player&cbr=chrome&cbrver=76.0.3809.146&co_rel=1&cos=windows&cosver=10.0&cplayer=uniplayer&cpn=qnb7ktt3nr-afjk1&cver=20200214&ei=matnxvfylokcxglbjix4ba&el=embedded&embed_config={}&eurl=https://manonabeach.com/north-scotland/sutherland/sango-bay&height=360&hl=en_us&html5=1&iframe=1&iv_load_policy=3&lact=171&sts=18305&video_id=4bam5wej-ek&width=640 20200219101727 https://www.youtube.com/get_video_info?html5=1&video_id=4Bam5wej-ek&cpn=qNB7KTt3Nr-afJk1&eurl=https%3A%2F%2Fmanonabeach.com%2Fnorth-scotland%2Fsutherland%2Fsango-bay&el=embedded&hl=en_US&sts=18305&lact=171&c=WEB_EMBEDDED_PLAYER&cver=20200214&cplayer=UNIPLAYER&cbr=Chrome&cbrver=76.0.3809.146&cos=Windows&cosver=10.0&iv_load_policy=3&autoplay=1&width=640&height=360&ei=MAtNXvfyLoKCxgLBjIX4BA&iframe=1&embed_config=%7B%7D&co_rel=1&ancestor_origins=https%3A%2F%2Fmanonabeach.com application/x-www-form-urlencoded 200 RU26TJOW7DW4VA6FDM6K6FALWTLSZBED - - 0 1920069 /1_data/npld/webrecorder/manonabeach/warcs/manonabeach-20200220092950.warc.gz
For many youtube urls, there won't be an exact match.. It is fuzzy matching on several query params, in this case just video_id
and html5
. (https://github.com/webrecorder/pywb/blob/master/pywb/rules.yaml#L376)
hat means it should do a prefix query to find this, but perhaps this is not working with OCDX?
Also want to mention, if prefix query is unfeasible due to size, it may be possible to do an alternate canonicalization to make video lookup more direct, eg something like https://videos.example.com/?video_id=...
. Not sure if that makes sense yet though.. I think OutbackCDX includes some fuzzy matching support already..
The problem seems to be that pywb is not performing a prefix query (it’s not passing matchType=prefix
)
All good now - working on the DEV services. Will roll out when we can.
This seems not to be working right, in a couple of different ways.
Firstly, in a clean install/test system, playback works fine. So it is possible!
But, accessing QA Wayback in Firefox, some of the sub-requests are not passing through the authentication header as expected. e.g.
https://www.webarchive.org.uk/act/wayback/archive/20200217104425oe_/https://r1---sn-bvvbax-ac5e.googlevideo.com/videoplayback?expire=1581957870&ei=jm5KXoKTIoeQVrLPpZAG&ip=194.66.237.31&id=o-AGQOq6OSZTzAnnnVigGVGRgQUveSMRzZIHmWcMOd5jPS&itag=18&source=youtube&requiressl=yes&mm=31%2C26&mn=sn-bvvbax-ac5e%2Csn-5hne6nlr&ms=au%2Conr&mv=m&mvi=0&nh=IgppcjAxLmxiYTAzKg4xOTQuODIuMTc0LjIxNw%2C&pl=21&initcwndbps=1851250&vprv=1&mime=video%2Fmp4&gir=yes&clen=23692960&ratebypass=yes&dur=305.272&lmt=1394606022352193&mt=1581936173&fvip=1&fexp=23842630&c=WEB_EMBEDDED_PLAYER&sparams=expire%2Cei%2Cip%2Cid%2Citag%2Csource%2Crequiressl%2Cvprv%2Cmime%2Cgir%2Cclen%2Cratebypass%2Cdur%2Clmt&sig=ALgxI2wwRAIgYTSq4l-9bsEaoXDwnjjooJudP8PrFfIo7pNTUxq-MrkCICjXWCdu0IAPaUnc2DLmxeTkpHLdYrGeuQmF03490pFs&lsparams=mm%2Cmn%2Cms%2Cmv%2Cmvi%2Cnh%2Cpl%2Cinitcwndbps&lsig=AHylml4wRQIhAMT0jY28WS4DgRzqj_f5QTtfFGG_n0V0AW7V6RGcVuvdAiAhXSiUP-lNmuGiwHy5tEo1U4AQRbg-2MI68ZgFcoU1XQ%3D%3D&cpn=oJMXib2j1u3GrJ4Y&cver=20200214&ptk=youtube_none&pltype=contentugc
gets a
401 Authorization Required
You have to be [logged into W3ACT](https://www.webarchive.org.uk/act/login) and be a member of a Legal Deposit Library to access this page.
Oddly, QA Wayback seems to work, sometimes, in Chrome at least.
The OA Wayback fails, but that's for NPLD scoping reasons. To work, we'll need access to embedded URLs like
https://www.webarchive.org.uk/wayback/archive/20200217090605if_/https://www.youtube.com/embed/7WqRVJjNGyY?autoplay=1&controls=1&wmode=opaque&rel=0&egm=0&iv_load_policy=3&hd=0
and the googlevideo.com
ones like the one above. e.g. should we let through
(com,youtube,www)/embed/
(com,googlevideo,
Tagging and building ukwa/python-w3act:2.1.2
...
And updated to run this version in production. If all goes well, this should be resolved in an hour or so.
Darn, allowed SURT was wrong as it included the www
subdomain but that should not be part of the canonical form. Tagging 2.1.3 and switching to that.
Okay, quite a few embed/transclusion sources needed to be added, but it's working now I think.
So, the front-page videos seem to be working. The sub-page videos are not generally working, but the one example I looked at is one we don't seem to have at all, so maybe that's the issue?
Having archived this site in WebRecorder: https://manonabeach.com
The WARCs are not quite playing back in our QA Wayback service. the WARC has URLs like:
but proxy-mode playback 404s requesting:
This might just be because our QA Wayback is a bit out of date compared with Pywb.