openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
https://pypi.org/project/warc2zim/
GNU General Public License v3.0
44 stars 4 forks source link

Zimit2: HTTP return codes are not handled properly #220

Closed benoit74 closed 4 months ago

benoit74 commented 5 months ago

There is two (at least) problem to consider:

rgaudin commented 5 months ago

Indeed ; that used to be handled by the SW via the Headers I believe. Interesting discussion ahead!

kelson42 commented 5 months ago
  • redirect 301, 302, 307 and 308 should lead to the creation of an alias in most situations (only redirections leading to the same ZIM path should be ignored, e.g. http to https redirection since we drop the scheme)

Not an alias, a redirect. ZIM redirects can only be internal, so redirects to external resources should be just ignored. Clearly an important thing to fix.

benoit74 commented 5 months ago

I propose that we first implement a first enhancement which will cover 80% of the situation.

WDYT?

rgaudin commented 5 months ago

add ZIM redirects for 301, 302, 307 and 308 to an internal URL

Of course, this is a regression

all other 2xx and 3xx return codes, and 301, 302, 307 and 308 to an external URL are either ignored (with a warning in the log) or stop the scraper

203, 204 and 205 can be easily supported. Other 20x are too niche or depends on headers (206 especially).

Regarding redirects to external URI, we should not include them in the ZIM. This is the reasonable behavior: if the user is online when reading, it could work. If he's not, it will fail in the most expect-able way possible. I see no reason to stop the crawler for this. Warning has limited value but it's cheap so let's do it.

all 1xx, 4xx and 5xx are simply ignored with a warning in the log (not sure we have any 1xx or 5xx in the WARC records in fact)

There should be 5xx records in WARCs but not for the seed URL now that we use the option to fail in that case. In zimit1 we chose to not include them in the ZIM as those are considered not actionnable by end-user and should thus be considered external ; offering possibility of online access.

benoit74 commented 5 months ago

Do you mean that for 203, we can simply store the content? Same is probably true for 202 then.

I don't get what you want to do with 204 (simply ignore it? no content is ... no content ... do you intend to add an empty ZIM entry just to indicate that URL is valid?) and 205 (simply store content and forget about the request from upstream to reset requesting document?)

rgaudin commented 5 months ago

Yes, that's exactly what would be in the WARCs and there's no reason to discard them as those don't rely on headers. We're probably talking about incredibly small number or entries anyway

benoit74 commented 5 months ago

We're probably talking about incredibly small number or entries anyway

Definitely, but better to fix as many cases as possible at once when they are simple rather than come back many time to it, you are right about mentioning these cases should be handled as well.

Jaifroid commented 5 months ago

203, 204 and 205 can be easily supported. Other 20x are too niche or depends on headers (206 especially).

So, what happens with 206 partial content responses if they have been recorded in the WARC Headers and Responses? In the case of zimit1, we (in KJS) have code to set our Response headers appropriately if we receive a WARC Response Header indicating a range.

For zimit2, I still need to know what to expect in the Response body, as we have to reconstruct Response headers in our Service Worker in KJS in order to reply to requests made by the browser. Are partial responses just discarded, or are they converted to full content responses, for example?

benoit74 commented 5 months ago

For now, all WARC records with an "unmanaged" HTTP response code will be ignored by warc2zim2 (e.g. 206).

This means that if there is not other WARC record with another HTTP response code in the WARC archive, the corresponding URLs found in documents will be considered as external to the ZIM and hence pointing to the online resource.

If the user is online, the website will be able to fetch the payload. If user is offline, result will depend on how the website handles a failing call to this URL.

Jaifroid commented 5 months ago

If the user is online, the website will be able to fetch the payload.

That would be risky and probably blocked by CORS, definitely blocked by the sandbox at least in KJS, but I'm pretty sure also in Kiwix Desktop.

I think how we currently handle 206 requests that the user makes (e.g. if the user moves the slider in a video quickly to another part of the video) is that we recognize the browser has made a 206 request, and we reconstruct the response as a 206 starting at the requested starting point, and ending at the end of the requested medium (video). This minimizes requests if the user skips around.

Maybe this is sufficient also to deal with any dynamic partial content requests made by the JS player (if it does stream that way). So, let's not worry about it unless we hit such issues in the wild. 🙂

benoit74 commented 4 months ago

Fixed by https://github.com/openzim/warc2zim/pull/244