internetstandards / Internet.nl

Internet standards compliance test suite
https://internet.nl
164 stars 36 forks source link

Large batch scan does not return report and never finishes #1395

Closed stitch closed 2 months ago

stitch commented 2 months ago

A scan with a larger number of domains hangs on status 'generating'. The report is never generated. There is no further info. Below are two examples. The first is the happy flow, the second is the flow where the batch will never generate a report. Perhaps it is possible to re-start the task that creates this report in order to see what happens.

Happy flow:

https://docker.batch.internet.nl/api/batch/v2/requests/f9bf210b247c4e66beee488abc1d24b8

{
  "request": {
    "name": "{\"source\": \"Web Security Map\", \"type\": \"web\"}",
    "submit_date": "2024-04-16T08:58:07.946993+00:00",
    "finished_date": null,
    "request_type": "web",
    "status": "running",
    "request_id": "f9bf210b247c4e66beee488abc1d24b8"
  },
  "api_version": "2.4.0"
}

https://docker.batch.internet.nl/api/batch/v2/requests/f9bf210b247c4e66beee488abc1d24b8

{
  "request": {
    "name": "{\"source\": \"Web Security Map\", \"type\": \"web\"}",
    "submit_date": "2024-04-16T08:58:07.946993+00:00",
    "finished_date": "2024-04-16T08:59:27.250105+00:00",
    "request_type": "web",
    "status": "generating",
    "request_id": "f9bf210b247c4e66beee488abc1d24b8"
  },
  "api_version": "2.4.0"
}

https://docker.batch.internet.nl/api/batch/v2/requests/f9bf210b247c4e66beee488abc1d24b8

{
  "request": {
    "name": "{\"source\": \"Web Security Map\", \"type\": \"web\"}",
    "submit_date": "2024-04-16T08:58:07.946993+00:00",
    "finished_date": "2024-04-16T08:59:27.250105+00:00",
    "request_type": "web",
    "status": "done",
    "request_id": "f9bf210b247c4e66beee488abc1d24b8"
  },
  "api_version": "2.4.0"
}

https://docker.batch.internet.nl/api/batch/v2/requests/f9bf210b247c4e66beee488abc1d24b8/results

{
  "api_version": "2.4.0",
  "request": {
    "name": "{\"source\": \"Web Security Map\", \"type\": \"web\"}",
    "submit_date": "2024-04-16T08:58:07.946993+00:00",
    "finished_date": "2024-04-16T08:59:27.250105+00:00",
    "request_type": "web",
    "status": "done",
    "request_id": "f9bf210b247c4e66beee488abc1d24b8"
  },
  "domains": {
    "asm.com": {
      "status": "ok",
      "report": {
        "url": "http://docker.batch.internet.nl/site/asm.com/31787/"
      },
      "scoring": {
        "percentage": 52
      },
      "results": {
        "categories": {
          "web_ipv6": {
            "verdict": "failed",
            "status": "failed"
          },
          "web_dnssec": {
            "verdict": "failed",
            "status": "failed"
...

Hanging flow

25k scans: https://docker.batch.internet.nl/api/batch/v2/requests/c390d7b499d5411fa0b98f1b2cb68841

{
  "request": {
    "name": "{\"source\": \"Web Security Map\", \"type\": \"web\"}",
    "submit_date": "2024-04-12T16:54:46.368657+00:00",
    "finished_date": "2024-04-13T10:47:43.621597+00:00",
    "request_type": "web",
    "status": "generating",
    "request_id": "c390d7b499d5411fa0b98f1b2cb68841"
  },
  "api_version": "2.4.0"
}

https://docker.batch.internet.nl/api/batch/v2/requests/c390d7b499d5411fa0b98f1b2cb68841/results

{
  "error": {
    "label": "bad-request",
    "msg": "The request is not yet `done`."
  },
  "api_version": "2.4.0"
}
bwbroersma commented 2 months ago

Also a smaller 10k list (actually 9951 domains) also gives this problem:

$ curl -sSfu "$AUTH" "https://docker.batch.internet.nl/api/batch/v2/requests" | jq
{
  "requests": [
    {
      "name": "{\"source\": \"Web Security Map\", \"type\": \"web\"}",
      "submit_date": "2024-04-16T10:47:12.551099+00:00",
      "finished_date": "2024-04-16T13:44:12.515913+00:00",
      "request_type": "web",
      "status": "generating",
      "request_id": "a13380072f6f491ca4051754d248bf52"
    }
  ],
  "api_version": "2.4.0"
}

A ~10k list is not a problem in the current non docker batch setup.

mxsasha commented 2 months ago

@gthess I still need to dig into this, but before I do, any thoughts? Prior issues like this before my time?

gthess commented 2 months ago

On the non-docker setups (and before the current hoster even) I don't remember there being a limit on the number of domains. There were recommended limits communicated when someone was registering but not hard coded IIRC as I remember some pretty high values for domain lists (that were precommunicated). Since this is stuck at the generating stage my first guess would be memory or disk limits when generating the report.

mxsasha commented 2 months ago

I have been running some numbers, graphs and tests. Conclusions:

Guestimated recommendation: worker memory limit to 12 GB, maximum job size to 10K domains. Pending: see if we can raise worker concurrency to 300 or 400 to make better use of available CPU resources for improved job speed.

mxsasha commented 2 months ago

With additional testing from @bwbroersma this seems fine now at:

WORKER_MEMORY_LIMIT=14G
WORKER_SLOW_MEMORY_LIMIT=14G
WORKER_CONCURRENCY=400

Worker concurrency to 400 upped the CPU usage but not memory as expected.