sphinx-doc / sphinx

The Sphinx documentation generator
https://www.sphinx-doc.org/
Other
6.61k stars 2.13k forks source link

HTML search: represent indices using JavaScript Map instead of object literals #13097

Open jayaddison opened 2 weeks ago

jayaddison commented 2 weeks ago

Feature or Bugfix

Purpose

Detail

Relates

Edit: update description to reflect JavaScript Map-based approach.

AA-Turner commented 1 week ago

Would this work as an alternative?

    // in Search.setIndex()
    const frozen_index = JSON.parse(index, (key, value) => {
      typeof value === "object" && value !== null
        ? Object.freeze(value)
        : value;
    });

If we change Search.setIndex() to taking a JSON string, it seems we could use the 'reviver' function to freeze arrays and objects?

A

jayaddison commented 1 week ago

@AA-Turner although I'd recommend some performance testing, that could be useul for #13098, yep.

jayaddison commented 1 week ago

In fact, no, my mistake: JSON.parse may in fact be necessary to solve this robustly. The test appears only to be passing due to the fact that __proto__ is found in the objects part of the searchindex.js file. The terms entry for __proto__ is not being handled correctly.

jayaddison commented 1 week ago

I've pushed the change to use JSON.parse and reviver -- a note of caution, though: when I was testing that against a JSON-ized version of the Python documentation searchindex.js file yesterday, I was encountering ~200ms duration within the setIndex function in my browser.

wlach commented 1 week ago

Personally, I'd be inclined to merge the previous version of this and address that (if at all) in a followup. Trying to "freeze" the index here feels like scope creep to me and it would be nice to get this fix in.

jayaddison commented 1 week ago

That's reasonable, yep, and I've no strong opinions on whether we include or omit the Object.freeze aspect. However, without 462c8595937f2034f86322c8650e3f76f81de905, I don't believe this branch would genuinely fix the reported bug.

jayaddison commented 1 week ago

Without the reviver function, I recall the setIndex + JSON.parse performance overhead being in the order of ~60ms on my local machine. I'll confirm those stats in some more detail within the next day or so. I get the feeling that some of this will be coupled to the browser+implementation (Firefox) details, so I may try to compare to Chrom[e|ium] also.

wlach commented 1 week ago

In fact, no, my mistake: JSON.parse may in fact be necessary to solve this robustly. The test appears only to be passing due to the fact that __proto__ is found in the objects part of the searchindex.js file. The terms entry for __proto__ is not being handled correctly.

Sorry, I missed this part. Do you know why this is the case? The solution looked correct to me?

jayaddison commented 1 week ago

In fact, no, my mistake: JSON.parse may in fact be necessary to solve this robustly. The test appears only to be passing due to the fact that __proto__ is found in the objects part of the searchindex.js file. The terms entry for __proto__ is not being handled correctly.

Sorry, I missed this part. Do you know why this is the case? The solution looked correct to me?

The test case in 0fff170a5193741f53b56c0304e1864b754fa81a was inaccurate -- it demonstrated that an object Object.__proto__ could be found based on a user query for __proto__.

However, there was still a bug: a document text term __proto__ -- represented by the "terms":{"__proto__":0 fragment of the searchindex.js -- was not searchable as intended.

jayaddison commented 5 days ago

Without the reviver function, I recall the setIndex + JSON.parse performance overhead being in the order of ~60ms on my local machine. I'll confirm those stats in some more detail within the next day or so. I get the feeling that some of this will be coupled to the browser+implementation (Firefox) details, so I may try to compare to Chrom[e|ium] also.

I haven't gotten around to this comprehensive performance testing yet, but want to share some notes:

Initial results

Initially the native Map approach (up to fae81a0a92e81618c8fc70b2db1f3f93bc0dccaa) seems similar in runtime performance to the existing/baseline approach (object literals), based on my testing using Sphinx's self-built documentation.

The page-load searchindex-evaluation time on my machine for a query for phin (~700 results) is ~30ms in Firefox, compared to a ~20ms baseline. Those are both relatively small compared to a ~0.5s duration for display/rendering of the search results, but arguably that is a pessimal query (a more-precise match would reduce the display duration).

There is a small amount of index size bloat -- not so much from the new Map( and ) text overhead, but more from the additional Array literals (e.g. square brackets) used to format their entries instead of object literals. This is approximately 5% overhead on the test dataset; 541857 bytes compared to 516691 baseline.

Observations so far

There seem to be multiple factors to balance: Python complexity, JavaScript complexity, code safety (both datastructure immutability, and also prototype access), runtime performance and index filesize.

I'm not sure how much to explore and how to present my thoughts/findings. Generally my feeling is that the Map approach here (again, up to fae81a0a92e81618c8fc70b2db1f3f93bc0dccaa) seems good on most criteria except for Python complexity (particularly the unit tests, but also the application serialization code is more complex than it was) and immutability (that it does not address, in my opinion). The tc39 Records-and-Tuples proposal, I think, would be near-optimal on all factors -- but we could be years away from that reaching widespread adoption.

jayaddison commented 2 days ago

There is a small amount of index size bloat -- not so much from the new Map( and ) text overhead, but more from the additional Array literals (e.g. square brackets) used to format their entries instead of object literals. This is approximately 5% overhead on the test dataset; 541857 bytes compared to 516691 baseline.

My guess is that most of this difference (which is relatively small) would disappear with compression.

I should have thought to evaluate that at the same time - from a quick check with brotli - indeed, the difference becomes fairly insigificant (102833 vs 102137, less than 1%). Perhaps it's helpful for compression that the characterset becomes slightly more predictable.