sphinx-doc / sphinx

The Sphinx documentation generator
https://www.sphinx-doc.org/
Other
6.61k stars 2.13k forks source link

HTML search: make the search index immutable at load-time #13098

Open jayaddison opened 2 weeks ago

jayaddison commented 2 weeks ago

Is your feature request related to a problem? Please describe. When Sphinx projects are built as HTML, they generally include a search index file loaded by the included JavaScript search functionality.

Search is by-nature a feature that accepts user-controlled input. I have been considering this recently during #13096, and also as part of a related (not-yet-published) security report I've filed that fortunately I've subsequently concluded contains no actual vulnerability.

I think it could be worthwhile to make the contents of the HTML search index immutable and to remove references to the JavaScript Object prototype from them, with the objective of reducing the possibility that user-controlled input could affect the index and therefore the integrity/quality of results.

Describe the solution you'd like

Describe alternatives you've considered

Additional context I plan to offer a pull request to implement this feature alongside this feature request.

Also, I'm aware that other JavaScript on the page could potentially bypass a mechanism like this; and contrastingly, perhaps there are valid use-cases for modifying the search index at runtime. Despite those, I feel that it could be worth removing the prototypes and making the index read-only, but this could be debatable.

Edit: add a question-mark asking whether allowing Array prototype references to be retained is OK or not.

jayaddison commented 2 weeks ago

... Describe alternatives you've considered ...

  • The JSON.rawJSON function was also considered; this resolves the concerns with JSON.parse. However: it appears not to be widely-deployed yet (it is not available in Firefox or Safari, for example. It became available in Chrome 114 and Edge 114, both released ~Jun 2023).

Self-correction: JSON.rawJSON appears to be limited to primitive types; it cannot construct arrays or objects (in fact, this is highlighted by some bold text in the introduction on that page, too. I missed both of those previously).

jayaddison commented 2 weeks ago

... Describe alternatives you've considered ...

  • The JSON.rawJSON function was also considered; this resolves the concerns with JSON.parse. However: it appears not to be widely-deployed yet (it is not available in Firefox or Safari, for example. It became available in Chrome 114 and Edge 114, both released ~Jun 2023).

Self-correction: JSON.rawJSON appears to be limited to primitive types; it cannot construct arrays or objects (in fact, this is highlighted by some bold text in the introduction on that page, too. I missed both of those previously).

There is a tc39 (ECMAscript technical committee) proposal for a JSON.parseImmutable(...) function that appears relevant, though: https://github.com/tc39/proposal-json-parseimmutable

jayaddison commented 2 weeks ago

Remove prototypes from objects (with the exception of Array objects).

I'm having second thoughts about this exception case. Perhaps we should in fact remove the prototype from Array objects too (this would require some other small changes). I'll wait to gather feedback about that.

jayaddison commented 1 week ago

There is a tc39 (ECMAscript technical committee) proposal for a JSON.parseImmutable(...) function that appears relevant, though: https://github.com/tc39/proposal-json-parseimmutable

A challenge with this approach would be that a record-and-tuple based representation would no longer be a subset of JSON, so we'd require a parser to adapt our loading of a searchindex from file. I've raised a somewhat-related question/proposal at tc39/proposal-record-tuple#391 (tl;dr - would a single # character be enough to annote a many-object datastructure as immutable? if so, we could simply adjust our PREFIX value to Search.setIndex(#).

wlach commented 1 week ago

FWIW I'm skeptical about this. As you noted, any other JavaScript loaded on the page could still mess with the index (including overwriting it altogether with whatever), no matter how we "freeze" it. I don't really see the use case here?

jayaddison commented 1 week ago

Reducing the likelihood with which the search index content could be modified or provide access to other parts of the application, either accidentally or maliciously, is essentially what I'm requesting.

User input here necessarily follows code paths that access the datastructure -- and I believe that's safe as-currently-implemented -- but I'd have greater confidence in that if I knew (again, to a reasonable degree; I understand it's not guaranteed) that the search index contents were immutable and absent of prototype references.