ukwa / ukwa-pywb

GNU General Public License v3.0
11 stars 3 forks source link

HTTP 304 being returned - should these records be dropped? #40

Open anjackson opened 5 years ago

anjackson commented 5 years ago

We have a specific Twitter URL recorded as a 304 Not Modified (via a browser rendering cache checking event).

https://www.webarchive.org.uk/wayback/archive/20190312214115/https://ton.twimg.com/tfw/js/native_bundle_v1_2c6dc837d4dedb22fff5faf9e125064ba2cbeb8a.js

it seems to be played back as a 304 rather than skipped. The CDX records look like

...
com,twimg,ton)/tfw/js/native_bundle_v1_2c6dc837d4dedb22fff5faf9e125064ba2cbeb8a.js 20190312213613 https://ton.twimg.com/tfw/js/native_bundle_v1_2c6dc837d4dedb22fff5faf9e125064ba2cbeb8a.js application/javascript 200 NMGUH2FASUEU67TMO7EM7Y7FJ7W5ZUNI 0
com,twimg,ton)/tfw/js/native_bundle_v1_2c6dc837d4dedb22fff5faf9e125064ba2cbeb8a.js 20190312213614 https://ton.twimg.com/tfw/js/native_bundle_v1_2c6dc837d4dedb22fff5faf9e125064ba2cbeb8a.js application/http 304 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ 0
com,twimg,ton)/tfw/js/native_bundle_v1_2c6dc837d4dedb22fff5faf9e125064ba2cbeb8a.js 20190312213615 https://ton.twimg.com/tfw/js/native_bundle_v1_2c6dc837d4dedb22fff5faf9e125064ba2cbeb8a.js application/http 304 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ 0
com,twimg,ton)/tfw/js/native_bundle_v1_2c6dc837d4dedb22fff5faf9e125064ba2cbeb8a.js 20190312213616 https://ton.twimg.com/tfw/js/native_bundle_v1_2c6dc837d4dedb22fff5faf9e125064ba2cbeb8a.js application/http 304 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ 0
...
anjackson commented 9 months ago

These should probably be filtered out during indexing, and/or when querying the index (as per #30). Not technically anything to so with PyWB itself.