scrapfly / typescript-scrapfly

SDK for Scrapfly.io web scraping API
https://scrapfly.io/
Other
7 stars 5 forks source link

Parsing a successful `ScrapeResult` fails if the response's content type is defined in the `Content-Type` header field instead of `content-type` #8

Closed fulopkovacs closed 4 weeks ago

fulopkovacs commented 4 weeks ago

The issue

Parsing a successful ScrapeResult fails if the response's content type is defined in the Content-Type header field instead of content-type.

The field names in HTTP/1.1 response headers are supposed to be case-insensitive:

Each field line consists of a case-insensitive field name followed by a colon (":"), optional leading whitespace, the field line value, and optional trailing whitespace. HTTP/1.1 RFC 9112, Section 5

This is the source of the issue:

https://github.com/scrapfly/typescript-scrapfly/blob/a09d6b90266a4e75046f25f6f5e0360b285f3dd1/src/result.ts#L290

Error message

This is the error message I get (with Node, not Deno):

TypeError: Cannot read properties of undefined (reading 'includes')
        at get selector [as selector]

Steps to reproduce

The contents of the api-response-bad.json file (a shortened version of the response I obtained by scraping https://www.headleymedia.com/resources/your-guide-to-email-lead-nurturing with Scrapfly)_ ```json { "context": { "asp": null, "bandwidth_consumed": 0, "bandwidth_images_consumed": 0, "cache": { "entry": null, "state": "MISS" }, "cookies": [], "cost": {}, "created_at": "2024-08-21 14:39:26.491852", "debug": null, "env": "LIVE", "fingerprint": "4499f87b6b0cbcc70e364752b39fefb8", "headers": {}, "is_xml_http_request": false, "job": null, "lang": ["en"], "os": {}, "project": "default", "proxy": {}, "redirects": [], "retry": 0, "schedule": null, "session": null, "spider": null, "throttler": null, "uri": { "base_url": "https://www.headleymedia.com", "fragment": null, "host": "www.headleymedia.com", "params": null, "port": 443, "query": null, "root_domain": "headleymedia.com", "scheme": "https" }, "url": "https://www.headleymedia.com/resources/your-guide-to-email-lead-nurturing", "webhook": null }, "result": { "browser_data": { "javascript_evaluation_result": null, "js_scenario": null, "local_storage_data": {}, "session_storage_data": {}, "websockets": [], "xhr_call": [] }, "content": "

hello world

", "content_encoding": "utf-8", "content_format": "raw", "content_type": "text/html; charset=utf-8", "cookies": [], "data": null, "dns": null, "duration": 9.62, "error": null, "extracted_data": null, "format": "text", "iframes": [], "log_url": "https://scrapfly.io/dashboard/monitoring/log/01J5TP1NQ0071K6WY7HS6VM7TC", "reason": "OK", "request_headers": { "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7", "accept-encoding": "gzip, deflate, br, zstd", "accept-language": "en-US,en;q=0.9", "priority": "u=0, i", "sec-ch-ua": "\"Not)A;Brand\";v=\"99\", \"Google Chrome\";v=\"127\", \"Chromium\";v=\"127\"", "sec-ch-ua-mobile": "?0", "sec-ch-ua-platform": "\"Linux\"", "sec-fetch-dest": "document", "sec-fetch-mode": "navigate", "sec-fetch-site": "none", "sec-fetch-user": "?1", "upgrade-insecure-requests": "1", "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36" }, "response_headers": { "Cache-Control": "private", "Connection": "keep-alive", "Content-Encoding": "gzip", "Content-Type": "text/html; charset=utf-8", "Date": "Wed, 21 Aug 2024 14:39:29 GMT", "Server": "nginx/1.18.0 (Ubuntu)", "Transfer-Encoding": "chunked" }, "screenshots": { "debug": { "css_selector": null, "extension": "jpg", "format": "fullpage", "size": 556350, "url": "https://api.scrapfly.io/scrape/screenshot/01J5TP1NQ0071K6WY7HS6VM7TC/debug" } }, "size": 0, "ssl": null, "status": "DONE", "status_code": 200, "success": true, "url": "https://www.headleymedia.com/resources/your-guide-to-email-lead-nurturing" } } ```
async function reproduce() {
  // see the contents of the `api-response-bad.json` file above
  const responseHtmlSuccess = JSON.parse(await Deno.readTextFile('api-response-bad.json'));
  const result = new ScrapeResult(responseJsonSuccess);
  // the line below will throw an error (see above)
  const first_h1 = result.selector("h1").text();
  console.log({ first_h1 });
}
reproduce()

This issue is currently breaks some of our features in production

I discovered this issue when I started investigating this mysterious error that keeps popping up, making one of our services that relies on Scrapfly randomly fail from time to time. For now I'll try to patch it in our code, but we'd be very grateful if this could be fixed soon. (Happy to submit a PR too, but not sure if you accept them.)

Granitosaurus commented 4 weeks ago

Hey @fulopkovacs thanks for the detailed report. You're correct that headers should be handled in a case insensitive manner. Will try to replicate this and push a fix :+1:

fulopkovacs commented 4 weeks ago

Woah, that was a super fast response! 🙌

Granitosaurus commented 4 weeks ago

hey @fulopkovacs I've released a fix in v0.6.5 which is available on NPM and JSR. Let me know if this bug still pops up somehow.

fulopkovacs commented 4 weeks ago

Tested locally, works like a charm! Thanks for the quick fix! ☺️