unknown 17-field cdx format

rebeccacremona commented 5 years ago

not sure what GUID this was....

    May 30 17:10:23 ip-172-31-48-6 docker-compose[644]: nginx_1       | 172.69.63.9 - - [30/May/2019:17:10:23 +0000] "GET /api/v1/upload/4LG3ZDPO?user=public HTTP/1.1" 200 217 "-" "python-r
equests/2.20.0"
    May 30 17:10:23 ip-172-31-48-6 docker-compose[644]: app_1         | [pid: 15|app: 0|req: 46487/276388] 52.91.77.74 () {56 vars in 877 bytes} [Thu May 30 17:10:23 2019] GET /api/v1/uploa
d/4LG3ZDPO?user=public => generated 217 bytes in 6 msecs (HTTP/1.1 200) 2 headers in 72 bytes (3 switches on core 393)
    May 30 17:10:23 ip-172-31-48-6 docker-compose[644]: app_1         | [pid: 10|app: 0|req: 62583/276389] 162.158.117.137 () {36 vars in 555 bytes} [Thu May 30 17:10:23 2019] GET /api/v1 =
> generated 36088 bytes in 147 msecs (HTTP/1.1 200) 2 headers in 67 bytes (3 switches on core 394)
    May 30 17:10:23 ip-172-31-48-6 docker-compose[644]: nginx_1       | 162.158.117.137 - - [30/May/2019:17:10:23 +0000] "GET /api/v1 HTTP/1.1" 200 36088 "-" "Mozilla/5.0 (compatible; Cloud
flare-Traffic-Manager/1.0; +https://www.cloudflare.com/traffic-manager/; pool-id: 857383f5bcfc41d0)"
    May 30 17:10:23 ip-172-31-48-6 docker-compose[644]: app_1         | Traceback (most recent call last):
    May 30 17:10:23 ip-172-31-48-6 docker-compose[644]: app_1         |   File "./webrecorder/models/importer.py", line 244, in run_upload
    May 30 17:10:23 ip-172-31-48-6 docker-compose[644]: app_1         |     self.process_pages(info, page_id_map)
    May 30 17:10:23 ip-172-31-48-6 docker-compose[644]: app_1         |   File "./webrecorder/models/importer.py", line 289, in process_pages
    May 30 17:10:23 ip-172-31-48-6 docker-compose[644]: app_1         |     pages = self.detect_pages(info['coll'], info['rec'])
    May 30 17:10:23 ip-172-31-48-6 docker-compose[644]: app_1         |   File "./webrecorder/models/importer.py", line 455, in detect_pages
    May 30 17:10:23 ip-172-31-48-6 docker-compose[644]: app_1         |     cdxj = CDXObject(member.encode('utf-8'))
    May 30 17:10:23 ip-172-31-48-6 docker-compose[644]: app_1         |   File "/usr/local/lib/python3.7/site-packages/pywb-2.3.0.dev0-py3.7.egg/pywb/warcserver/index/cdxobject.py", line 15
3, in __init__
    May 30 17:10:23 ip-172-31-48-6 docker-compose[644]: app_1         |     raise CDXException(msg)
    May 30 17:10:23 ip-172-31-48-6 docker-compose[644]: app_1         | pywb.warcserver.index.cdxobject.CDXException: unknown 17-field cdx format

rebeccacremona commented 5 years ago

Example from a similar warc, with the enhanced logging on:

    Jun  5 17:46:03 ip-172-31-60-73 docker-compose[32104]: app_1         |     raise CDXException(msg)
    Jun  5 17:46:03 ip-172-31-60-73 docker-compose[32104]: app_1         | pywb.warcserver.index.cdxobject.CDXException: unknown 27-field cdx format: [b"file:///0kwG5NyyFJ9/source/ArticleContentPage;pos=1;tile='", b'+', b'dc_tile', b'+', b"';Article=LatestNews;'", b'+', b'dcoptTag', b'+', b"'sz=160x600;ord='", b'+', b'ord', b'+', b"'", b'20150302105657', b'{"url":"file:///0kwG5NyyFJ9/source/ArticleContentPage;pos=1;tile=\'', b'+', b'dc_tile', b'+', b"';Article=LatestNews;'", b'+', b'dcoptTag', b'+', b"'sz=160x600;ord='", b'+', b'ord', b'+', b'\'","mime":"unk","digest":"S7XBFXLBHON6VPPRQAVQ4VQECS5KVD4L","length":"919","offset":"2228210","filename":"rec-20190605174603357843-20fdddd86300-C4JQFCKH.warc.gz"}']

rebeccacremona commented 5 years ago

This appears to be due to warcs with entries like 'WARC-Target-URI', 'file:///0upUSSNF7rH/source/Badger Sow and Cubs - Yellowstone National Park.jpg'

rebeccacremona commented 5 years ago

The CDXLine objects are created here: https://github.com/webrecorder/pywb/blob/master/pywb/warcserver/index/cdxobject.py#L153

So I think the splitting has already happened, and redis has too many entries.

rebeccacremona commented 5 years ago

Processing locally with old pywb produces the same exception; Perma's CDXLine database does NOT contain cdx entries for the resources with spaces in their name/path.

I don't think we can recover from this. These warcs will never be indexed properly.

Since the screenshot of https://perma.cc/0upUSSNF7rH plays back, the error is evidently NOT fatal. So, we just have to decide what we want to do.

Options:

after we watch a few more go by and satisfy ourselves that only old, possibly otherwise corrupt warcs have these bad spaces in them, we can silence all CDX indexing errors going forward
we can comment out or otherwise remove the problematic entries from these warcs, as the errors surface
we can attempt to replace the spaces with %20 or something else, as the errors surface, so that indexing can complete; playback will likely not improve, but the errors will stop
perhaps I can adjust https://github.com/webrecorder/warcio/blob/c64c4394805e13256695f51af072c95389397ee9/warcio/recordloader.py#L217 to disallow spaces

To be discussed.

rebeccacremona commented 5 years ago

For now, trying that last idea. Let's watch try this, and watch the warning logs and see how often this happens. And let's discuss with the Webrecorder team.

rebeccacremona commented 5 years ago

At least in prod, that attempted fix isn't working, or isn't working completely... the logging messages print, but we still get the CDX error.....

More examples:

Jun  6 11:12:49 ip-172-31-48-6 docker-compose[31206]: app_1         | pywb.warcserver.index.cdxobject.CDXException: unknown 5-field cdx format: [b'file:///TKH8-R4VY/source/Signal', b'VIP.jpg', b'20150302104602', b'{"url":"file:///TKH8-R4VY/source/Signal', b'VIP.jpg","mime":"image/jpeg","digest":"N3OHGGHEBEJUMNUQRQXGGVCAS5IGPZ2D","length":"3352","offset":"2997372","filename":"rec-20190606111249068046-c61cd61fd791-5PXR5AGA.warc.gz"}']

and

    Jun  6 19:42:00 ip-172-31-48-6 docker-compose[9588]: app_1         | pywb.warcserver.index.cdxobject.CDXException: unknown 5-field cdx format: [b'file:///088s3gtTLhg/source/Markey', b'Cover_sm.jpg', b'20150302105133', b'{"url":"file:///088s3gtTLhg/source/Markey', b'Cover_sm.jpg","mime":"image/jpeg","digest":"GJCFF2FW7YLHOCALZWJRR3RATXL2MS37","length":"13228","offset":"2034546","filename":"rec-20190606194159744143-96b1e2d3ba3a-OQ7VNGI3.warc.gz"}']

rebeccacremona commented 5 years ago

todo: find out if those really are regular spaces.

rebeccacremona commented 5 years ago

bytes vs str? non-breaking? what else could it be?

rebeccacremona commented 5 years ago

The problem seems to have been with the technique I used to patch, which didn't get all the necessary places. I still have no idea why I can't get these error messages to log locally.

Pulling the thread on how these warcs were created.... these all seem to be captures produced by wget, as directories of resources, converted by this Link method, possibly on the fly. Evidently rel_path or file_name occasionally had spaces in them.

I am now attempting to see if wget ever produces filenames like that now.

rebeccacremona commented 5 years ago

Ha, it totally does:

wget --adjust-extension --span-hosts --convert-links -e robots=off --page-requisites --no-directories --no-check-certificate "https://www.yellowstonenationalpark.com/wolves.htm"

MSC02VK05ZHTDG:fun2 rcremona$ ls
Artist Point - Yellowstone National Park.jpg
Badger Sow and Cubs - Yellowstone National Park.jpg
Castle Geyser Rainbow - Yellowstone National Park.jpg
Copy (2) of wolvesbar.JPG
Daisy Geyser - Yellowstone National Park.jpg
Elk Fighting Madison River - Yellowstone National Park.jpg
Elk Velvet - Yellowstone National Park.jpg
Elk in Fog - Yellowstone National Park.jpg
Firehole River - Yellowstone National Park.jpg
Grand Prismatic Spring.jpg
Grizzly and Cub - Yellowstone National Park.jpg
Hayden Alpha Femaile Wolf - Yellowstone National Park.jpg
Morning Glory Pool - Yellowstone National Park.jpg
Old Faithful - Yellowstone National Park.jpg
Old Faithful Evening - Yellowstone National Park.jpg
Snowcoach Skiers - Yellowstone National Park.jpg
Terraces - Yellowstone National Park.jpg
butt1.gif
butt2.gif
butt3.gif
butt4.gif
camping.gif
communities.gif
conservation.gif
contact.gif
dining.gif
easy_rotator.min.js
extra.css
ezcl.webp?cb=4
flyfishing.gif
hiking.gif
lodging.gif
maps.gif
p-31iz6hfFutd16.gif?labels=Domain.yellowstonenationalpark_com,DomainId.79055
p7AP3-08.css
p7AP3scripts.js
p7ap3-columns.css
p7ap3_east_black.gif
p7ap3_page_black.gif
p7ap3_south_black.gif
p?c1=2&c2=20015427&cv=2.0&cj=1
photography.gif
rochester.js?cb=184-1&v=8
shop.gif
show_ads.js
show_ads.js.1
skiing.gif
snowcoach.gif
snowmobiling.gif
titlebg.gif
tours.gif
waterfalls.gif
wildflowers.gif
wildlife order cover.jpg
wildlife.gif
wolf.JPG
wolves.htm
wolves.mp4
wonders cover.jpg
ynpphoto.JPG
ynptitle.gif

rebeccacremona commented 5 years ago

The current version of wget, run with the --warc option, correctly percent encodes the spaces in the urls WARC-Target-URI: <https://www.yellowstonenationalpark.com/images/wonders%20cover.jpg>

Which makes sense. So, we may be the only archive who has warcs with spaces in some of their WARC-Target-URIs. I'll probably send a PR to warcio anyway, because it PROBABLY makes sense to validate data at that point, which is where the incorrect <> is also removed?

rebeccacremona commented 5 years ago

Okay, the reason I could not reproduce the error locally is because the prod warcs, downloaded from Perma via the UI, get extra, detailed warcinfo on top:

WARC/1.0
WARC-Type: warcinfo
WARC-Record-ID: <urn:uuid:f76f08f8-8ec8-11e9-80cb-120aa1957c28>
WARC-Filename: 0upUSSNF7rH.warc.gz
WARC-Date: 2019-06-14T17:22:20Z
Content-Type: application/warc-fields
Content-Length: 188

operator: Perma.cc download
Perma-GUID: 0upUSSNF7rH
format: WARC File Format 1.0
json-metadata: {"title": "Perma Archive, Yellowstone Park Wolves", "desc": null, "type": "collection"}

WARC/1.0
WARC-Type: warcinfo
WARC-Record-ID: <urn:uuid:f76f14c4-8ec8-11e9-80cb-120aa1957c28>
WARC-Filename: 0upUSSNF7rH.warc.gz
WARC-Date: 2019-06-14T17:22:20Z
Content-Type: application/warc-fields
Content-Length: 315

operator: Perma.cc download
Perma-GUID: 0upUSSNF7rH
format: WARC File Format 1.0
json-metadata: {"title": "Perma Archive of Yellowstone Park Wolves", "pages": [{"title": "Yellowstone Park Wolves", "timestamp": "20131121063810", "url": "http://www.yellowstonenationalpark.com/wolves.htm"}], "type": "recording"}

As a result, when I use the warc locally, pages is not None here....https://github.com/webrecorder/webrecorder/blob/master/webrecorder/webrecorder/models/importer.py#L288, avoid the code path that was throwing the error in prod.

rebeccacremona commented 5 years ago

Great. But now I want to understand why other code that creates CDXLine() objects isn't throwing errors.

rebeccacremona commented 5 years ago

Documenting: this is no longer a problem. Investigating why the other CDXLine code hasn't been throwing exceptions is just for my edification. Leaving this open as a reminder this is something I should learn more about, so as to better understand how WR works.

harvard-lil / perma

unknown 17-field cdx format #2605