isker / neogrok

Neogrok is a frontend for zoekt, a fast and scalable code search engine.
https://neogrok-demo-web.fly.dev
MIT License
28 stars 2 forks source link

Poor perf with a large number of matches #7

Closed isker closed 1 year ago

isker commented 1 year ago

When a query produces a large number of matches, perf is quite poor, both in the DOM and in handling the enormous HTTP response from zoekt.

I have tested out a few different mitigation strategies locally, but they were all either ineffective (e.g. lazily parsing the base64 content from zoekt into renderable data, such that we don't parse data that never reaches the DOM) or were only effective some times (e.g. debouncing live search more aggressively with shorter queries - we can only apply partially effective heuristics, like query length, to parameterize any kind of debouncing, and the subjective snappiness of the site is harmed in the process).

Ultimately the problem is that zoekt imposes this poor performance upon neogrok, and neogrok can't make up for it sufficiently in all cases.

There are two big problems:

  1. There is no facility in zoekt to put a hard limit on the number of matches returned; the various MaxMatchCount parameters are only evaluated with varying granularities that allow them to frequently be wildly exceeded; the most specific granularity seems to be on something like a per-document basis, such that a single file with thousands of matches will still blast your perf.
  2. Secondarily, the JSON serialization of zoekt's SearchResult Content is pointlessly wasteful. Zoekt returns binary content: when rendering JSON, golang serializes []bytes into base64 strings, and zoekt's match indices are also byte indices. (Bytes are an interesting choice, as zoekt categorically only indexes UTF-8 content; but JS strings aren't UTF-8 anyway, so we wouldn't necessarily be saved from this computation if the content and indices were provided in that encoding.) Converting this binary data into renderable text in browsers is quite expensive. Doing it on the nodejs backend is counterintuitively even slower, as the conversion pegs the CPU for a time and wrecks the server's ability to concurrently serve requests.

Sourcegraph mitigates these problems by putting their own golang server between the browser and zoekt. This server both puts hard limits on match counts (500 by default), and serves "rendered" chunks (where data is provided in much more immediately usable line-by-line strings) to the browser. This is (empirically) a fine approach, when the client of zoekt is itself written in go, but less performant clients are left wanting.

I will try to see about getting enhancements upstreamed to zoekt, at least for problem 1. I suspect that upstreaming a fix to problem 1 (by adding a MaxMatchDisplayCount or something like that) will be more tractable in all regards than upstreaming a fix to problem 2, and will probably eliminate problem 2 as a real factor for us anyway.

isker commented 1 year ago

Here's what a sourcegraph.com search result looks like (for a single file), as sent to the browser. This is a pretty ideal format that we have no access to as direct clients of zoekt.

  {
    "type": "content",
    "path": "ui/src/main/java/org/adoptopenjdk/jitwatch/util/UserInterfaceUtil.java",
    "pathMatches": [
      {
        "start": {
          "offset": 48,
          "line": 0,
          "column": 48
        },
        "end": {
          "offset": 65,
          "line": 0,
          "column": 65
        }
      }
    ],
    "repositoryID": 52556,
    "repository": "github.com/AdoptOpenJDK/jitwatch",
    "repoStars": 2896,
    "repoLastFetched": "2023-07-09T20:16:44.782207Z",
    "branches": [
      ""
    ],
    "commit": "53abd156464924de1ac68cb1e09afe60b0c474dc",
    "hunks": null,
    "chunkMatches": [
      {
        "content": "public final class UserInterfaceUtil",
        "contentStart": {
          "offset": 1109,
          "line": 35,
          "column": 0
        },
        "ranges": [
          {
            "start": {
              "offset": 1128,
              "line": 35,
              "column": 19
            },
            "end": {
              "offset": 1145,
              "line": 35,
              "column": 36
            }
          }
        ]
      },
      {
        "content": "\tprivate UserInterfaceUtil()",
        "contentStart": {
          "offset": 1819,
          "line": 58,
          "column": 0
        },
        "ranges": [
          {
            "start": {
              "offset": 1828,
              "line": 58,
              "column": 9
            },
            "end": {
              "offset": 1845,
              "line": 58,
              "column": 26
            }
          }
        ]
      },
      {
        "content": "\tprivate static final Logger logger = LoggerFactory.getLogger(UserInterfaceUtil.class);",
        "contentStart": {
          "offset": 1148,
          "line": 37,
          "column": 0
        },
        "ranges": [
          {
            "start": {
              "offset": 1210,
              "line": 37,
              "column": 62
            },
            "end": {
              "offset": 1227,
              "line": 37,
              "column": 79
            }
          }
        ]
      },
      {
        "content": "\t\tUserInterfaceUtil.RESOURCE_FACTORY.setResources(ResourceBundle.getBundle(UserInterfaceUtil.RESOURCE_NAME, locale));",
        "contentStart": {
          "offset": 2285,
          "line": 74,
          "column": 0
        },
        "ranges": [
          {
            "start": {
              "offset": 2287,
              "line": 74,
              "column": 2
            },
            "end": {
              "offset": 2304,
              "line": 74,
              "column": 19
            }
          },
          {
            "start": {
              "offset": 2360,
              "line": 74,
              "column": 75
            },
            "end": {
              "offset": 2377,
              "line": 74,
              "column": 92
            }
          }
        ]
      },
      {
        "content": "\t\tInputStream inputStream = UserInterfaceUtil.class.getResourceAsStream(path);",
        "contentStart": {
          "offset": 3857,
          "line": 132,
          "column": 0
        },
        "ranges": [
          {
            "start": {
              "offset": 3885,
              "line": 132,
              "column": 28
            },
            "end": {
              "offset": 3902,
              "line": 132,
              "column": 45
            }
          }
        ]
      },
      {
        "content": "\t\tImage image = UserInterfaceUtil.IMAGE_CAMERA;",
        "contentStart": {
          "offset": 4345,
          "line": 154,
          "column": 0
        },
        "ranges": [
          {
            "start": {
              "offset": 4361,
              "line": 154,
              "column": 16
            },
            "end": {
              "offset": 4378,
              "line": 154,
              "column": 33
            }
          }
        ]
      },
      {
        "content": "\t\tString styleSheet = UserInterfaceUtil.class.getResource(\"/style.css\").toExternalForm();",
        "contentStart": {
          "offset": 5848,
          "line": 204,
          "column": 0
        },
        "ranges": [
          {
            "start": {
              "offset": 5870,
              "line": 204,
              "column": 22
            },
            "end": {
              "offset": 5887,
              "line": 204,
              "column": 39
            }
          }
        ]
      }
    ]
  },
isker commented 1 year ago

I will try to see about getting enhancements upstreamed to zoekt, at least for problem 1.

https://github.com/sourcegraph/zoekt/pull/615

isker commented 1 year ago

Judging this to be fixed by https://github.com/isker/neogrok/commit/9f169fbdf63b23b22cc811898eb0cc4f7a5d8297. With gzip compression, response sizes are quite acceptable on the demo app.

I still might lower the default MaxMatchDisplayCount from 500, but it's probably not necessary.