projectdiscovery / katana

A next-generation crawling and spidering framework.
MIT License
10.97k stars 583 forks source link

`-jsonl` option with `-headless` option results in error #611

Open ErikOwen opened 1 year ago

ErikOwen commented 1 year ago

katana version: v1.0.4

Current Behavior: using the -jsonl and -headless options for a katana crawl results in an error: [hybrid:RUNTIME] context deadline exceeded <- could not get dom.

Expected Behavior:

No error should occur, similar to running the same command without the -headless option.

Steps To Reproduce:

  1. Run a katana crawl against a target with the -jsonl and -headless options:
    > echo "https://projectdiscovery.io" | katana -silent -d 1 -jsonl -ob -or -headless | jq
    {
    "timestamp": "2023-10-01T10:47:56.072225-07:00",
    "request": {
    "method": "GET",
    "endpoint": "https://projectdiscovery.io"
    },
    "error": "[hybrid:RUNTIME] context deadline exceeded <- could not get dom"
    }
  2. Notice the error: [hybrid:RUNTIME] context deadline exceeded <- could not get dom
  3. Run the same command without the -headless option:
    > echo "https://projectdiscovery.io" | katana -silent -d 1 -jsonl -ob -or | jq
    {
    "timestamp": "2023-10-01T10:50:08.902251-07:00",
    "request": {
    "method": "GET",
    "endpoint": "https://projectdiscovery.io"
    },
    "response": {
    "status_code": 200,
    "headers": {
      "cache_control": "public, max-age=0, must-revalidate",
      "report_to": "{\"endpoints\":[{\"url\":\"https:\\/\\/a.nel.cloudflare.com\\/report\\/v3?s=0Kep5aODKo2qRkrQjl%2FTGOcspCMTVBMeLdtA6Gc4y9E5UkFkUW9QYffz4bUgV6TgXpxudfSsHDctxNKUH3%2B979xOe6kOqAHsxy3n4HDyPkXBK2zjDQAguH0ajFdYfUtZ9I74BUw%3D\"}],\"group\":\"cf-nel\",\"max_age\":604800}",
      "nel": "{\"success_fraction\":0,\"report_to\":\"cf-nel\",\"max_age\":604800}",
      "content_type": "text/html",
      "last_modified": "Sat, 30 Sep 2023 00:06:34 GMT",
      "server_timing": "region;desc=\"us-west-2\", cache;desc=\"cached\", fallback;desc=\"no-fallback\"",
      "vary": "Accept-Encoding",
      "cf_ray": "80f68bd47e26521a-LAX",
      "link": "<https://framerusercontent.com>; rel=\"preconnect\", <https://framerusercontent.com>; rel=\"preconnect\"; crossorigin=\"\"",
      "cf_cache_status": "DYNAMIC",
      "date": "Sun, 01 Oct 2023 17:50:08 GMT",
      "server": "cloudflare",
      "x_content_type_options": "nosniff",
      "strict_transport_security": "max-age=0; preload",
      "connection": "keep-alive"
    },
    "technologies": [
      "Cloudflare",
      "HSTS"
    ]
    }
    }
  4. Note that there are no errors when the -headless option is omitted.

Anything else:

This issue only occurs when crawling specific websites. I can consistently reproduce it when crawling https://projectdiscovery.io and https://www.discover.com. But I am unable to reproduce it crawling other sites like https://www.google.com.

ocervell commented 12 months ago

I ran into the same issue today.

RamanaReddy0M commented 6 months ago

@ErikOwen can you try the latest version(v1.0.5)? It seems working with the latest release.

ErikOwen commented 6 months ago

Hi @RamanaReddy0M, thank you for following up! I tried running the same command to reproduces this error using the latest code in the dev branch, and now I'm seeing some paths show a successful response, and some paths still have the [hybrid:RUNTIME] context deadline exceeded <- could not get dom error. So it seems like progress is being made on this issue, but the issue still persists.

Here is the output from when I ran the command: katana_logs.txt