iipc / warc-specifications

Centralised repository for WARC usage specifications.
http://iipc.github.io/warc-specifications/
98 stars 30 forks source link

WARC-Resource-Type field possibilities (feedback wanted) #96

Open ikreymer opened 6 months ago

ikreymer commented 6 months ago

Browsers have different ways of reporting the 'resource type' for any resource that's being fetched. When using browser-based crawling, it is often easy to access this 'resource type' and store it in a custom WARC header.

It is possible to introduce a WARC-Resource-Type header to store this type. Unfortunately, there isn't a single standard of 'resource types' and various browser APIs expose different variations on this.

If a resource type is written to a WARC header, is there a way to make it future proof to support different vocabularies?

Some possibilities include:

One approach to make this more future proof might be to prefix the resourceType with a namespace based on where the data is coming from and which vocabulary is used.

For example, if using CDP, cdp:Document or cdp:Image, if using webRequest, might be webRequest:sub_frame, webRequest:image, if using destination, destination:image, destination:document, etc...

This allows for expanding into other vocabularies in the future, but may be harder to parse.

Alternatively, there could be a fixed vocabulary that is allowed that is a common subset of at least 2 of the above, which might be: document, image, media, script, stylesheet, font, ping, websocket, fetch and a catch-all other.

(In this case, we should specify what the more specific values are recorded as, eg. main_frame / sub_frame would be recorded as document)

Other thoughts / suggestions welcome!

ikreymer commented 6 months ago

I should note our initial implementation just stores the Chrome CDP value, eg. WARC-Resource-Type: Document, WARC-Resource-Type: Image, etc... w/o a prefix, as that was the easiest to try. We could also just keep that, but wanted to see if there were any thoughts on the above proposals. Other tools that work directly with Chrome Debug Protocol, such as Brozzler or the Chrome Extractor for Heritrix, would actually have the same vocabulary as well, so may not be an immediate concern. Mostly a question of other tools / future proofing to support vocabulary not coming from CDP, if such a header were to be standardized.

tw4l commented 6 months ago

Note that Puppeteer and Playwright use the CDP values but lowercased: https://playwright.dev/docs/api/class-request#:~:text=resourceType%E2%80%8B&text=ResourceType%20will%20be%20one%20of,%2C%20websocket%20%2C%20manifest%20%2C%20other%20

tw4l commented 6 months ago

Playwright mapping for Firefox: https://github.com/microsoft/playwright/blob/73ffaf65d75b2378168ac5a11eb37cced03ff6ea/packages/playwright-core/src/server/firefox/ffNetworkManager.ts#L161

ato commented 6 months ago

Do we have any use cases in mind for this field when reading the WARC?

I guess one might be be listing all the top-level crawled documents. This can't be done accurately by Content-Type alone as XHR/Fetch requests can have text/html responses.

The main_frame/sub_frame distinction also seems interesting for that use case. It's not in the CDP resource type but if we map to one of the other vocabularies presumably it could be determined from the frameId?

I guess the hopsFromSeed metadata field could be used for listing top-level crawled documents but it's coarse grained and doesn't make distinctions between different kinds of embedded content.

It's also possible for an image to have a text/html Content-Type and still display correctly due to MIME sniffing. So similarly if you wanted to do something with all the images in a crawl, Content-Type alone is insufficient.

tw4l commented 6 months ago

We've added this to our WARCs in response to a user-submitted issue: https://github.com/webrecorder/browsertrix-crawler/issues/451, with the primary use case being differentiating between resources fetched by JavaScript (via fetch, xhr) versus resources loaded directly from the HTML.

edsu commented 6 months ago

This is probably off topic for this issue, but it came up recently in the context of using mailbagit that it would be useful to know if a record is for a seed URL. Or is there another common way of doing that? The motivation here is to be able to pick out URLs from the WARC data to serve as entry points during replay.

ato commented 6 months ago

it would be useful to know if a record is for a seed URL. Or is there another common way of doing that?

For WARCs created by Heritrix a metadata record without the via and hopsFromSeed fields is indicative of a seed. If the crawler doesn't populate those fields though I don't think there's a reliable way to tell from a WARC file alone. Requests without a Referer header might also be indicative for some crawlers but but not ones that obey Referrer-Policy: no-referrer.

WACZ defines an accompanying pages.jsonl file for entry points.

benoit74 commented 2 months ago

See https://github.com/webrecorder/browsertrix-crawler/issues/630 for a feedback "from the trenches".