internetarchive / liveweb

Liveweb proxy of the Wayback Machine project
https://web.archive.org/
44 stars 13 forks source link

Verify the ARC files created by the liveweb2.0 #16

Open anandology opened 12 years ago

anandology commented 12 years ago

Compare them with liveweb1.0 and make sure CDX-writer works on these.

anandology commented 12 years ago
  1. Response of accessing http://httpbin.org/user-agent

Liveweb-1.0

http://httpbin.org/user-agent 107.21.118.103 20120413113256 application/json 202\n
HTTP/1.1 200 OK\r\n
Content-Type: application/json\r\n
Date: Fri, 13 Apr 2012 11:32:56 GMT\r\n
Server: gunicorn/0.13.4\r\n
Content-Length: 45\r\n
Connection: keep-alive\r\n
\r\n
{\n
  "user-agent": "ia_archiver(OS-Wayback)"\n
}\n

Liveweb 2.0

http://httpbin.org/user-agent 204.236.238.79 20120413113046 application/json      202\n
HTTP/1.1 200 OK\r\n
Content-Type: application/json\r\n
Date: Fri, 13 Apr 2012 11:30:46 GMT\r\n
Server: gunicorn/0.13.4\r\n
Content-Length: 45\r\n
Connection: keep-alive\r\n
\r\n
{\n
  "user-agent": "ia_archiver(OS-Wayback)"\n
}

Differences:

anandology commented 12 years ago

2 - Accessing an website that didn't exist. Tried with http://nosite/

Liveweb 1.0

http://nosite/ 0.0.0.0 20120413113604 unk 22\n
HTTP 502 Bad Gateway\n
\n
\n

Liveweb 2.0

http://nohost/ 0.0.0.0 20120413122918 unk 22\n
HTTP 502 Bad Gateway\n
\n
\n

Matches exactly!

anandology commented 12 years ago

part -1 seems to be fixed now.

http://httpbin.org/user-agent 107.21.118.103 20120413123533 application/json 202\n
HTTP/1.1 200 OK\r\n
Content-Type: application/json\r\n
Date: Fri, 13 Apr 2012 12:35:33 GMT\r\n
Server: gunicorn/0.13.4\r\n
Content-Length: 45\r\n
Connection: keep-alive\r\n
\r\n
{\n
  "user-agent": "ia_archiver(OS-Wayback)"\n
}\n

It is matching the liveweb 1.0 response.

anandology commented 12 years ago

3 - Testing 404

Liveweb 1.0

http://httpbin.org/status/404 107.21.123.247 20120413124130 text/html 171\n
HTTP/1.1 404 NOT FOUND\r\n
Content-Type: text/html; charset=utf-8\r\n
Date: Fri, 13 Apr 2012 12:41:30 GMT\r\n
Server: gunicorn/0.13.4\r\n
Content-Length: 0\r\n
Connection: keep-alive\r\n
\r\n
\n

Liveweb 2.0

http://httpbin.org/status/404 107.21.118.103 20120413124555 text/html 171\n
HTTP/1.1 404 NOT FOUND\r\n
Content-Type: text/html; charset=utf-8\r\n
Date: Fri, 13 Apr 2012 12:45:55 GMT\r\n
Server: gunicorn/0.13.4\r\n
Content-Length: 0\r\n
Connection: keep-alive\r\n
\r\n
\n

Matched.

anandology commented 12 years ago

4 - testing 302

Liveweb 1.0

http://httpbin.org/status/302 107.21.123.247 20120413124439 unk 168\n
HTTP/1.1 302 FOUND\r\n
Date: Fri, 13 Apr 2012 12:44:39 GMT\r\n
Location: http://httpbin.org/redirect/1\r\n
Server: gunicorn/0.13.4\r\n
Content-Length: 0\r\n
Connection: keep-alive\r\n
\r\n
\n

Liveweb 2.0

http://httpbin.org/status/302 107.21.118.103 20120413124702 application/octet-stream 168\n
HTTP/1.1 302 FOUND\r\n
Date: Fri, 13 Apr 2012 12:47:01 GMT\r\n
Location: http://httpbin.org/redirect/1\r\n
Server: gunicorn/0.13.4\r\n
Content-Length: 0\r\n
Connection: keep-alive\r\n
\r\n
\n

Matched.

rajbot commented 12 years ago

I can now create cdx files from the arcs produced by liveweb.

rajbot commented 12 years ago

Is this fixed now?

anandology commented 12 years ago

@nibrahim is writing testcases for this.