Open edsu opened 2 years ago
Watching the HTTP requests in Production for:
https://swap.stanford.edu/was/20220921195257/https://nodontdie.com/louie-roots
You can see that this request generates a HTTP 500 response after spinning for 2 minutes:
whereas it works fine in Stage:
It would be good to see what the actual error but it doesn't appear to be logged to the Apache error log.
I saw a bunch of these (104) in the /var/log/syslog
when the request failed:
Oct 3 18:12:36 was-pywb-prod poetry[993635]: Mon Oct 3 18:12:36 2022 - *** uWSGI listen queue of socket ":8081" (fd: 5) full !!! (101/100) ***
I'm wondering if the duplicate entries from having ingested both
If you look at the underlying CDX request for this you see this response:
{"urlkey": "com,googlevideo,rr4---sn-a5mekn6s)/videoplayback?c=web_embedded_player&cm2rm=sn-o09sd7s&cms_redirect=yes&cmsv=e&cnr=14&cpn=bik9wgahrc5jhvyi&cver=1.20220918.00.00&dur=5588.230&ei=umwry6pgdcon2lypre2i2am&expire=1663811866&fexp=24001373,24007246&id=o-adz2f4-l-smqnl8l31k7a7-cg3h0wt_ga8aikxgcxg2t&ip=137.184.177.15&itag=18&lmt=1643054906292951&lsig=ag3c_xawraigf5-rgf2xd0lozxcgzwxgoxqhqzmgwxa-vdytr57h5lccieavokgnaie3zdafqho7fbimphutntsriqa9kdgbdema&lsparams=mh,mm,mn,ms,mv,mvi,pl&mh=gi&mime=video/mp4&mm=34&mn=sn-a5mekn6s&ms=ltu&mt=1663790218&mv=m&mvi=4&n=1kzdgoh5orw1ag&ns=hgbb8j72m15xhohixiz33qsi&pl=20&pltype=contentugc&ptk=youtube_none&ratebypass=yes&redirect_counter=1&req_id=7066179c6559a3ee&requiressl=yes&sig=aoq0qj8wrqigi72ue5jztjgbj2jadhg0fmfpndh_ec1gpi7il-a95riciqd4jentwa5oqsixjisc_ijd1dscp3yxazfac6xaxhsw7w==&source=youtube&sparams=expire,ei,ip,id,itag,source,requiressl,spc,vprv,mime,ns,cnr,ratebypass,dur,lmt&spc=yr2vp5oyfrtwxtxwdiqfteeej98xy1c&txp=6218224&vprv=1", "timestamp": "20220921195748", "url": "https://rr4---sn-a5mekn6s.googlevideo.com/videoplayback?expire=1663811866&ei=umwrY6PgDcON2LYPre2I2AM&ip=137.184.177.15&id=o-ADz2f4-l-SMQNL8L31k7a7-cG3h0WT_gA8AIKxgcXG2t&itag=18&source=youtube&requiressl=yes&spc=yR2vp5OyfrTwxTXwDIqftEEeJ98XY1c&vprv=1&mime=video%2Fmp4&ns=HGBB8J72m15xHOhIxIZ33QsI&cnr=14&ratebypass=yes&dur=5588.230&lmt=1643054906292951&fexp=24001373,24007246&c=WEB_EMBEDDED_PLAYER&txp=6218224&n=1kZdGOH5ORW1ag&sparams=expire%2Cei%2Cip%2Cid%2Citag%2Csource%2Crequiressl%2Cspc%2Cvprv%2Cmime%2Cns%2Ccnr%2Cratebypass%2Cdur%2Clmt&sig=AOq0QJ8wRQIgI72UE5jZTJgBj2jadHg0fMFpnDh_EC1gpi7IL-a95RICIQD4jEnTWA5oqSIxjIsc_IJd1dsCP3yxAzfAC6xaxHSW7w%3D%3D&cpn=BIK9wGaHrC5jhvYi&cver=1.20220918.00.00&ptk=youtube_none&pltype=contentugc&redirect_counter=1&cm2rm=sn-o09sd7s&req_id=7066179c6559a3ee&cms_redirect=yes&cmsv=e&mh=gi&mm=34&mn=sn-a5mekn6s&ms=ltu&mt=1663790218&mv=m&mvi=4&pl=20&lsparams=mh,mm,mn,ms,mv,mvi,pl&lsig=AG3C_xAwRAIgf5-RGF2xd0lozXCgZwxGOxqhQZmGWxa-vDYtr57H5lcCIEAvoKgnaIe3ZDafQHO7fbimPhUTNTsRiqa9KdGBDemA", "mime": "video/mp4", "status": "200", "digest": "sha1:XWXMGAQPGUJI6FHCAWLI6ABQQ5F5Y6OB", "length": "161861539", "offset": "383819330", "filename": "qg137zm7702/sv/558/gk/6917/don-t-die-peter-chan-rec-20220921195258533131-crawl-manual-20220921195228-385b66de-86c-0.warc.gz", "source": "was:level2.cdxj", "source-coll": "was", "access": "allow"}
{"urlkey": "com,googlevideo,rr4---sn-a5mekn6s)/videoplayback?c=web_embedded_player&cm2rm=sn-o09sd7s&cms_redirect=yes&cmsv=e&cnr=14&cpn=bik9wgahrc5jhvyi&cver=1.20220918.00.00&dur=5588.230&ei=umwry6pgdcon2lypre2i2am&expire=1663811866&fexp=24001373,24007246&id=o-adz2f4-l-smqnl8l31k7a7-cg3h0wt_ga8aikxgcxg2t&ip=137.184.177.15&itag=18&lmt=1643054906292951&lsig=ag3c_xawraigf5-rgf2xd0lozxcgzwxgoxqhqzmgwxa-vdytr57h5lccieavokgnaie3zdafqho7fbimphutntsriqa9kdgbdema&lsparams=mh,mm,mn,ms,mv,mvi,pl&mh=gi&mime=video/mp4&mm=34&mn=sn-a5mekn6s&ms=ltu&mt=1663790218&mv=m&mvi=4&n=1kzdgoh5orw1ag&ns=hgbb8j72m15xhohixiz33qsi&pl=20&pltype=contentugc&ptk=youtube_none&ratebypass=yes&redirect_counter=1&req_id=7066179c6559a3ee&requiressl=yes&sig=aoq0qj8wrqigi72ue5jztjgbj2jadhg0fmfpndh_ec1gpi7il-a95riciqd4jentwa5oqsixjisc_ijd1dscp3yxazfac6xaxhsw7w==&source=youtube&sparams=expire,ei,ip,id,itag,source,requiressl,spc,vprv,mime,ns,cnr,ratebypass,dur,lmt&spc=yr2vp5oyfrtwxtxwdiqfteeej98xy1c&txp=6218224&vprv=1", "timestamp": "20220921195748", "url": "https://rr4---sn-a5mekn6s.googlevideo.com/videoplayback?expire=1663811866&ei=umwrY6PgDcON2LYPre2I2AM&ip=137.184.177.15&id=o-ADz2f4-l-SMQNL8L31k7a7-cG3h0WT_gA8AIKxgcXG2t&itag=18&source=youtube&requiressl=yes&spc=yR2vp5OyfrTwxTXwDIqftEEeJ98XY1c&vprv=1&mime=video%2Fmp4&ns=HGBB8J72m15xHOhIxIZ33QsI&cnr=14&ratebypass=yes&dur=5588.230&lmt=1643054906292951&fexp=24001373,24007246&c=WEB_EMBEDDED_PLAYER&txp=6218224&n=1kZdGOH5ORW1ag&sparams=expire%2Cei%2Cip%2Cid%2Citag%2Csource%2Crequiressl%2Cspc%2Cvprv%2Cmime%2Cns%2Ccnr%2Cratebypass%2Cdur%2Clmt&sig=AOq0QJ8wRQIgI72UE5jZTJgBj2jadHg0fMFpnDh_EC1gpi7IL-a95RICIQD4jEnTWA5oqSIxjIsc_IJd1dsCP3yxAzfAC6xaxHSW7w%3D%3D&cpn=BIK9wGaHrC5jhvYi&cver=1.20220918.00.00&ptk=youtube_none&pltype=contentugc&redirect_counter=1&cm2rm=sn-o09sd7s&req_id=7066179c6559a3ee&cms_redirect=yes&cmsv=e&mh=gi&mm=34&mn=sn-a5mekn6s&ms=ltu&mt=1663790218&mv=m&mvi=4&pl=20&lsparams=mh,mm,mn,ms,mv,mvi,pl&lsig=AG3C_xAwRAIgf5-RGF2xd0lozXCgZwxGOxqhQZmGWxa-vDYtr57H5lcCIEAvoKgnaIe3ZDafQHO7fbimPhUTNTsRiqa9KdGBDemA", "mime": "video/mp4", "status": "200", "digest": "sha1:XWXMGAQPGUJI6FHCAWLI6ABQQ5F5Y6OB", "length": "161861539", "offset": "383819330", "filename": "qg137zm7702/yh/223/zs/4674/don-t-die-peter-chan-20220921202603400-385b66de-86c-0-rec-20220921195258533131-crawl-manual-20220921195228-385b66de-86c-0.warc.gz", "source": "was:level2.cdxj", "source-coll": "was", "access": "allow"}
{"urlkey": "com,googlevideo,rr4---sn-a5mekn6s)/videoplayback?c=web_embedded_player&cm2rm=sn-o09sd7s&cms_redirect=yes&cmsv=e&cnr=14&cpn=bik9wgahrc5jhvyi&cver=1.20220918.00.00&dur=5588.230&ei=umwry6pgdcon2lypre2i2am&expire=1663811866&fexp=24001373,24007246&id=o-adz2f4-l-smqnl8l31k7a7-cg3h0wt_ga8aikxgcxg2t&ip=137.184.177.15&itag=18&lmt=1643054906292951&lsig=ag3c_xawraigf5-rgf2xd0lozxcgzwxgoxqhqzmgwxa-vdytr57h5lccieavokgnaie3zdafqho7fbimphutntsriqa9kdgbdema&lsparams=mh,mm,mn,ms,mv,mvi,pl&mh=gi&mime=video/mp4&mm=34&mn=sn-a5mekn6s&ms=ltu&mt=1663790218&mv=m&mvi=4&n=1kzdgoh5orw1ag&ns=hgbb8j72m15xhohixiz33qsi&pl=20&pltype=contentugc&ptk=youtube_none&ratebypass=yes&redirect_counter=1&req_id=7066179c6559a3ee&requiressl=yes&sig=aoq0qj8wrqigi72ue5jztjgbj2jadhg0fmfpndh_ec1gpi7il-a95riciqd4jentwa5oqsixjisc_ijd1dscp3yxazfac6xaxhsw7w==&source=youtube&sparams=expire,ei,ip,id,itag,source,requiressl,spc,vprv,mime,ns,cnr,ratebypass,dur,lmt&spc=yr2vp5oyfrtwxtxwdiqfteeej98xy1c&txp=6218224&vprv=1", "timestamp": "20220921195748", "url": "https://rr4---sn-a5mekn6s.googlevideo.com/videoplayback?expire=1663811866&ei=umwrY6PgDcON2LYPre2I2AM&ip=137.184.177.15&id=o-ADz2f4-l-SMQNL8L31k7a7-cG3h0WT_gA8AIKxgcXG2t&itag=18&source=youtube&requiressl=yes&spc=yR2vp5OyfrTwxTXwDIqftEEeJ98XY1c&vprv=1&mime=video%2Fmp4&ns=HGBB8J72m15xHOhIxIZ33QsI&cnr=14&ratebypass=yes&dur=5588.230&lmt=1643054906292951&fexp=24001373,24007246&c=WEB_EMBEDDED_PLAYER&txp=6218224&n=1kZdGOH5ORW1ag&sparams=expire%2Cei%2Cip%2Cid%2Citag%2Csource%2Crequiressl%2Cspc%2Cvprv%2Cmime%2Cns%2Ccnr%2Cratebypass%2Cdur%2Clmt&sig=AOq0QJ8wRQIgI72UE5jZTJgBj2jadHg0fMFpnDh_EC1gpi7IL-a95RICIQD4jEnTWA5oqSIxjIsc_IJd1dsCP3yxAzfAC6xaxHSW7w%3D%3D&cpn=BIK9wGaHrC5jhvYi&cver=1.20220918.00.00&ptk=youtube_none&pltype=contentugc&redirect_counter=1&cm2rm=sn-o09sd7s&req_id=7066179c6559a3ee&cms_redirect=yes&cmsv=e&mh=gi&mm=34&mn=sn-a5mekn6s&ms=ltu&mt=1663790218&mv=m&mvi=4&pl=20&lsparams=mh,mm,mn,ms,mv,mvi,pl&lsig=AG3C_xAwRAIgf5-RGF2xd0lozXCgZwxGOxqhQZmGWxa-vDYtr57H5lcCIEAvoKgnaIe3ZDafQHO7fbimPhUTNTsRiqa9KdGBDemA", "mime": "video/mp4", "status": "200", "digest": "sha1:XWXMGAQPGUJI6FHCAWLI6ABQQ5F5Y6OB", "length": "161861539", "offset": "383819330", "filename": "qg137zm7702/yx/015/cx/1366/don-t-die-peter-chan-20220921202603400-385b66de-86c-0-rec-20220921195258533131-crawl-manual-20220921195228-385b66de-86c-0.warc.gz", "source": "was:level2.cdxj", "source-coll": "was", "access": "allow"}
The last two lines have identical properties except for filename
. I wonder if this could be causing problems for fuzzy matching?
Should we try to remove the duplicated entries?
Yes, I'm going to try to copy these WARCs out of the production environment and see if I can replicate the problem in my development environment. If that works then I can experiment with removing the duplicates to see if that helps? I get nervous about experimenting with the CDX files in production!
I've been able to replicate the problem in my development environment by downloading the WARCs for sv558gk6917
, yh223zs4674
and yx015cx1366
and indexing them with cdxj-indexer. At this point playback works fine, even with the duplicate WARC content.
But then I copied the CDX entries from our production level3.cdxj
that matched ^com,googlevideo,
and added them to my development index as well (8,323,879 index entries). After this the request above that fails in production but not in stage results in a 7 minute wait for a response.
In our production environment uwsgi is giving up on this request because it is taking too long, and realistically our users won't wait 7 minutes for a response anyway.
Here is what I see in the pywb log when running it manually with the wayback
command:
127.0.0.1 - - [2022-10-07 03:11:05] "POST /test/resource/postreq?url=https%3A%2F%2Frr3---sn-o097znsl.googlevideo.com%2Fvideoplayback%3Fexpire%3D1663811866%26ei%3DumwrY6PgDcON2LYPre2I2AM%26ip%3D137.184.177.15%26id%3Do-ADz2f4-l-SMQNL8L31k7a7-cG3h0WT_gA8AIKxgcXG2t%26itag%3D18%26source%3Dyoutube%26requiressl%3Dyes%26mh%3Dgi%26mm%3D31%252C29%26mn%3Dsn-o097znsl%252Csn-n4v7snls%26ms%3Dau%252Crdu%26mv%3Dm%26mvi%3D3%26pl%3D20%26initcwndbps%3D227500%26spc%3DyR2vp5OyfrTwxTXwDIqftEEeJ98XY1c%26vprv%3D1%26mime%3Dvideo%252Fmp4%26ns%3DHGBB8J72m15xHOhIxIZ33QsI%26cnr%3D14%26ratebypass%3Dyes%26dur%3D5588.230%26lmt%3D1643054906292951%26mt%3D1663789767%26fvip%3D4%26fexp%3D24001373%252C24007246%26c%3DWEB_EMBEDDED_PLAYER%26txp%3D6218224%26n%3D1kZdGOH5ORW1ag%26sparams%3Dexpire%252Cei%252Cip%252Cid%252Citag%252Csource%252Crequiressl%252Cspc%252Cvprv%252Cmime%252Cns%252Ccnr%252Cratebypass%252Cdur%252Clmt%26sig%3DAOq0QJ8wRQIgI72UE5jZTJgBj2jadHg0fMFpnDh_EC1gpi7IL-a95RICIQD4jEnTWA5oqSIxjIsc_IJd1dsCP3yxAzfAC6xaxHSW7w%253D%253D%26lsparams%3Dmh%252Cmm%252Cmn%252Cms%252Cmv%252Cmvi%252Cpl%252Cinitcwndbps%26lsig%3DAG3C_xAwRQIgHI4spdAtiH_OC3nQts77b8KCHteX4wt3SADv23oi3NkCIQDbg4ga_E4nug6E_xhKh-bhX1K7BRlCjIhlIoxIwrr4EQ%253D%253D%26cpn%3D8lLBIK9wGaHrC5jh%26cver%3D1.20220918.00.00%26ptk%3Dyoutube_none%26pltype%3Dcontentugc&closest=20220921195743&matchType=exact HTTP/1.1" 200 7501 420.605267
127.0.0.1 - - [2022-10-07 03:11:05] "GET /test/20220921195743oe_/https://rr3---sn-o097znsl.googlevideo.com/videoplayback?expire=1663811866&ei=umwrY6PgDcON2LYPre2I2AM&ip=137.184.177.15&id=o-ADz2f4-l-SMQNL8L31k7a7-cG3h0WT_gA8AIKxgcXG2t&itag=18&source=youtube&requiressl=yes&mh=gi&mm=31%2C29&mn=sn-o097znsl%2Csn-n4v7snls&ms=au%2Crdu&mv=m&mvi=3&pl=20&initcwndbps=227500&spc=yR2vp5OyfrTwxTXwDIqftEEeJ98XY1c&vprv=1&mime=video%2Fmp4&ns=HGBB8J72m15xHOhIxIZ33QsI&cnr=14&ratebypass=yes&dur=5588.230&lmt=1643054906292951&mt=1663789767&fvip=4&fexp=24001373%2C24007246&c=WEB_EMBEDDED_PLAYER&txp=6218224&n=1kZdGOH5ORW1ag&sparams=expire%2Cei%2Cip%2Cid%2Citag%2Csource%2Crequiressl%2Cspc%2Cvprv%2Cmime%2Cns%2Ccnr%2Cratebypass%2Cdur%2Clmt&sig=AOq0QJ8wRQIgI72UE5jZTJgBj2jadHg0fMFpnDh_EC1gpi7IL-a95RICIQD4jEnTWA5oqSIxjIsc_IJd1dsCP3yxAzfAC6xaxHSW7w%3D%3D&lsparams=mh%2Cmm%2Cmn%2Cms%2Cmv%2Cmvi%2Cpl%2Cinitcwndbps&lsig=AG3C_xAwRQIgHI4spdAtiH_OC3nQts77b8KCHteX4wt3SADv23oi3NkCIQDbg4ga_E4nug6E_xhKh-bhX1K7BRlCjIhlIoxIwrr4EQ%3D%3D&cpn=8lLBIK9wGaHrC5jh&cver=1.20220918.00.00&ptk=youtube_none&pltype=contentugc HTTP/1.1" 302 7061 420.617071
I'm not sure why the one requests shows up as two (a POST and a GET) in the log. I wonder if there is something we can be doing here to speed things up. I'm kind of confused why if the CDX lookup is quick why replay is not.
What happens if you remove duplicates in the development environment? Would it work normal again?
No, the duplicates are actually not negatively impacting the replay. The problem is the amount of time pywb is spending looking through the googlevideo.com
CDX entries. The good news that now that we can replicate it outside of the production environment it'll be easier to investigate what is going on, and maybe see if there is an optimization to be made.
A possibly related pywb issue: https://github.com/webrecorder/pywb/issues/573
Another case on same issue: https://swap.stanford.edu/was/20230505172005/https://library.stanford.edu/projects/player-piano-program
Reported by @peterchanws on 2022-09-29:
The failure to play back can be seen when trying to play this video, which works in stage but not in production:
@andrewjbtw confirmed that the WACZ files for sv558gk6917 and yx015cx1366 in production and stage have the same fixity value. I also checked that the WARC files and CDX entries are identical.
However @peterchanaws also reported that another version was uploaded to SDR as well, which did not finish uploading:
This does appear to be 331 MB instead of 12.3 GB. It is possible that this truncated WACZ is somehow corrupting playback?
The CDX files as a whole are much larger as a whole in production, than in stage. There is also a previous Archive-It crawl from 2020 in Production, which is evident when comparing the index results: