Closed rebeccacremona closed 5 years ago
Example from a similar warc, with the enhanced logging on:
Jun 5 17:46:03 ip-172-31-60-73 docker-compose[32104]: app_1 | raise CDXException(msg)
Jun 5 17:46:03 ip-172-31-60-73 docker-compose[32104]: app_1 | pywb.warcserver.index.cdxobject.CDXException: unknown 27-field cdx format: [b"file:///0kwG5NyyFJ9/source/ArticleContentPage;pos=1;tile='", b'+', b'dc_tile', b'+', b"';Article=LatestNews;'", b'+', b'dcoptTag', b'+', b"'sz=160x600;ord='", b'+', b'ord', b'+', b"'", b'20150302105657', b'{"url":"file:///0kwG5NyyFJ9/source/ArticleContentPage;pos=1;tile=\'', b'+', b'dc_tile', b'+', b"';Article=LatestNews;'", b'+', b'dcoptTag', b'+', b"'sz=160x600;ord='", b'+', b'ord', b'+', b'\'","mime":"unk","digest":"S7XBFXLBHON6VPPRQAVQ4VQECS5KVD4L","length":"919","offset":"2228210","filename":"rec-20190605174603357843-20fdddd86300-C4JQFCKH.warc.gz"}']
This appears to be due to warcs with entries like 'WARC-Target-URI', 'file:///0upUSSNF7rH/source/Badger Sow and Cubs - Yellowstone National Park.jpg'
The CDXLine objects are created here: https://github.com/webrecorder/pywb/blob/master/pywb/warcserver/index/cdxobject.py#L153
So I think the splitting has already happened, and redis has too many entries.
Processing locally with old pywb produces the same exception; Perma's CDXLine database does NOT contain cdx entries for the resources with spaces in their name/path.
I don't think we can recover from this. These warcs will never be indexed properly.
Since the screenshot of https://perma.cc/0upUSSNF7rH plays back, the error is evidently NOT fatal. So, we just have to decide what we want to do.
Options:
To be discussed.
For now, trying that last idea. Let's watch try this, and watch the warning logs and see how often this happens. And let's discuss with the Webrecorder team.
At least in prod, that attempted fix isn't working, or isn't working completely... the logging messages print, but we still get the CDX error.....
More examples:
Jun 6 11:12:49 ip-172-31-48-6 docker-compose[31206]: app_1 | pywb.warcserver.index.cdxobject.CDXException: unknown 5-field cdx format: [b'file:///TKH8-R4VY/source/Signal', b'VIP.jpg', b'20150302104602', b'{"url":"file:///TKH8-R4VY/source/Signal', b'VIP.jpg","mime":"image/jpeg","digest":"N3OHGGHEBEJUMNUQRQXGGVCAS5IGPZ2D","length":"3352","offset":"2997372","filename":"rec-20190606111249068046-c61cd61fd791-5PXR5AGA.warc.gz"}']
and
Jun 6 19:42:00 ip-172-31-48-6 docker-compose[9588]: app_1 | pywb.warcserver.index.cdxobject.CDXException: unknown 5-field cdx format: [b'file:///088s3gtTLhg/source/Markey', b'Cover_sm.jpg', b'20150302105133', b'{"url":"file:///088s3gtTLhg/source/Markey', b'Cover_sm.jpg","mime":"image/jpeg","digest":"GJCFF2FW7YLHOCALZWJRR3RATXL2MS37","length":"13228","offset":"2034546","filename":"rec-20190606194159744143-96b1e2d3ba3a-OQ7VNGI3.warc.gz"}']
todo: find out if those really are regular spaces.
bytes vs str? non-breaking? what else could it be?
The problem seems to have been with the technique I used to patch, which didn't get all the necessary places. I still have no idea why I can't get these error messages to log locally.
Pulling the thread on how these warcs were created.... these all seem to be captures produced by wget, as directories of resources, converted by this Link method, possibly on the fly. Evidently rel_path
or file_name
occasionally had spaces in them.
I am now attempting to see if wget ever produces filenames like that now.
Ha, it totally does:
wget --adjust-extension --span-hosts --convert-links -e robots=off --page-requisites --no-directories --no-check-certificate "https://www.yellowstonenationalpark.com/wolves.htm"
MSC02VK05ZHTDG:fun2 rcremona$ ls
Artist Point - Yellowstone National Park.jpg
Badger Sow and Cubs - Yellowstone National Park.jpg
Castle Geyser Rainbow - Yellowstone National Park.jpg
Copy (2) of wolvesbar.JPG
Daisy Geyser - Yellowstone National Park.jpg
Elk Fighting Madison River - Yellowstone National Park.jpg
Elk Velvet - Yellowstone National Park.jpg
Elk in Fog - Yellowstone National Park.jpg
Firehole River - Yellowstone National Park.jpg
Grand Prismatic Spring.jpg
Grizzly and Cub - Yellowstone National Park.jpg
Hayden Alpha Femaile Wolf - Yellowstone National Park.jpg
Morning Glory Pool - Yellowstone National Park.jpg
Old Faithful - Yellowstone National Park.jpg
Old Faithful Evening - Yellowstone National Park.jpg
Snowcoach Skiers - Yellowstone National Park.jpg
Terraces - Yellowstone National Park.jpg
butt1.gif
butt2.gif
butt3.gif
butt4.gif
camping.gif
communities.gif
conservation.gif
contact.gif
dining.gif
easy_rotator.min.js
extra.css
ezcl.webp?cb=4
flyfishing.gif
hiking.gif
lodging.gif
maps.gif
p-31iz6hfFutd16.gif?labels=Domain.yellowstonenationalpark_com,DomainId.79055
p7AP3-08.css
p7AP3scripts.js
p7ap3-columns.css
p7ap3_east_black.gif
p7ap3_page_black.gif
p7ap3_south_black.gif
p?c1=2&c2=20015427&cv=2.0&cj=1
photography.gif
rochester.js?cb=184-1&v=8
shop.gif
show_ads.js
show_ads.js.1
skiing.gif
snowcoach.gif
snowmobiling.gif
titlebg.gif
tours.gif
waterfalls.gif
wildflowers.gif
wildlife order cover.jpg
wildlife.gif
wolf.JPG
wolves.htm
wolves.mp4
wonders cover.jpg
ynpphoto.JPG
ynptitle.gif
The current version of wget, run with the --warc
option, correctly percent encodes the spaces in the urls WARC-Target-URI: <https://www.yellowstonenationalpark.com/images/wonders%20cover.jpg>
Which makes sense. So, we may be the only archive who has warcs with spaces in some of their WARC-Target-URIs. I'll probably send a PR to warcio anyway, because it PROBABLY makes sense to validate data at that point, which is where the incorrect <>
is also removed?
Okay, the reason I could not reproduce the error locally is because the prod warcs, downloaded from Perma via the UI, get extra, detailed warcinfo on top:
WARC/1.0
WARC-Type: warcinfo
WARC-Record-ID: <urn:uuid:f76f08f8-8ec8-11e9-80cb-120aa1957c28>
WARC-Filename: 0upUSSNF7rH.warc.gz
WARC-Date: 2019-06-14T17:22:20Z
Content-Type: application/warc-fields
Content-Length: 188
operator: Perma.cc download
Perma-GUID: 0upUSSNF7rH
format: WARC File Format 1.0
json-metadata: {"title": "Perma Archive, Yellowstone Park Wolves", "desc": null, "type": "collection"}
WARC/1.0
WARC-Type: warcinfo
WARC-Record-ID: <urn:uuid:f76f14c4-8ec8-11e9-80cb-120aa1957c28>
WARC-Filename: 0upUSSNF7rH.warc.gz
WARC-Date: 2019-06-14T17:22:20Z
Content-Type: application/warc-fields
Content-Length: 315
operator: Perma.cc download
Perma-GUID: 0upUSSNF7rH
format: WARC File Format 1.0
json-metadata: {"title": "Perma Archive of Yellowstone Park Wolves", "pages": [{"title": "Yellowstone Park Wolves", "timestamp": "20131121063810", "url": "http://www.yellowstonenationalpark.com/wolves.htm"}], "type": "recording"}
As a result, when I use the warc locally, pages
is not None
here....https://github.com/webrecorder/webrecorder/blob/master/webrecorder/webrecorder/models/importer.py#L288, avoid the code path that was throwing the error in prod.
Great. But now I want to understand why other code that creates CDXLine() objects isn't throwing errors.
Documenting: this is no longer a problem. Investigating why the other CDXLine code hasn't been throwing exceptions is just for my edification. Leaving this open as a reminder this is something I should learn more about, so as to better understand how WR works.
not sure what GUID this was....