FEATURE REQUEST: JS Literal Sanitising

pbower commented 10 months ago

In the attached documentation, the recommended approach for sending data to the client from the server is a JS literal, however I believe this has the potential for malicious arbitrary code injection https://core-docs.highchartspython.com/en/latest/quickstart.html.

Is there a recommended way around this ? If not, is there a feature that could be developed within the HighCharts front-end library to sanitise the received JS literal for consistency with the API?

Benefits of this feature could include:

Developers do not each need to create their own sanitisation based on the API (which is subject to change).
Improved security of the library.
Avoid overhead of using JSON in its place (converting to/from JSON).

Thanks heaps

PB

hcpchris commented 10 months ago

Hi @pbower -

Thanks for the feature request! Let me share some recommended ways to handle this first, then I'll share my (preliminary) thoughts on the feature request.

First, you are absolutely correct that passing the JS literal string from your Python environment to your JS environment creates some level of risk. The amount of risk depends on your application, and your application's overall architecture.

For example, if you do not allow your application's users to create custom callback/formatter functions, then the JS literal string will not contain custom/potentially dangerous code. This is generally a best practice: It's one thing for you, the application developer, to construct JavaScript code that will be executed. Something else entirely to let a random user do so. So my first recommendation would be to consider whether you need to give your users that level of power/control.

Second, if you do not have user-generated callback/formatter functions in your chart configuration, then all other properties are generally sanitized through Highcharts for Python's type validation. This means that numbers are serialized to JS literal numbers, Python bool values to Boolean's, strings to strings, etc. Which in turn means they should not execute as code when evaluated in your JS execution.

Now, given all of this, there are a couple of different strategies to further mitigate risk. Which strategy makes sense for you really depends a lot on your application and architecture:

If you are using using a template-driven Python web framework, like Flask or Django, you can include your JS literal string in your template file, and isolate out the elements that change based on user choices. Since you control your template, you control how much risk you expose.
If your Python code exposes a RESTful or GraphQL API that your JS front-end consumes, then you'll be returning your JS literal string as a string. This means you will need to evaluate it within your JS context. You should always do that by sandboxing it within JavaScript's new Function syntax (never eval which is both deprecated and very dangerous!). Here, you have full control over your security. You can apply whitelisting if you need it, you can tighten or loosen the new Function sandbox as needed, and you can even provide M2M encryption of your JS literal string in transit to better protect against man-in-the-middle attacks. Again, which techniques make sense will depend a lot on your use case.
When evaluating the JS literal string in your JS context, you can also do so applying one or more JS sanitization libraries like Google Caja, Sanitize.js, etc.

So that's it in terms of recommendations on this front. Now, here are some of my preliminary thoughts on building JS sanitization into Highcharts for Python:

First, this is something that I've given some thought to when initially designing HCP. And it's something I come back to and reconsider regularly. I agree that having sanitization supported in Highcharts for Python would be a net good, applying the basic principle of being "batteries included".

However, that positive needs to be balanced against the effort, complexity, and risk inherent to providing those capabilities.

In the JS ecosystem, there are a bunch of very sophisticated libraries available that provide good sanitization capabilities (Google's Caja comes to mind). However, that is JS within a JS context, and so does not help us. For one thing, the JS literal string still has to be evaluated in JS, and for another it adds an additional (possibly problematic) JS dependency. So a better approach is to sanitize when producing the JS literal string in Python.

But I have not been able to find a "JS sanitizing library in Python" that we could use. That's not really surprising, because producing JS code programmatically within Python is a pretty niche use case.

So to do that kind of pre-serialization sanitization, we'd have to write a sanitizer function that applies one or more sanitizing strategies against the JS literal (or it's component parts).

But sanitizing code programmatically is non-trivial in its complexity. Should we blacklist? Whitelist? Enforce encapsulation / sandboxing? Each choice has trade-offs, and those trade-offs will have different downstream implications for different users of Highcharts for Python. A strategy that works for Customer A might be a deal-breaker for Customer B due to the differences in their use cases. Offering configurability of our putative sanitization logic only increases the complexity.

And with all of that complexity comes a maintenance burden, made even more significant because now we find ourselves in an arms race with bad actors. And by making the claim that we provide sanitized JS, our users (many of whom are data scientists, students, or relatively inexperienced developers) may be lulled into a false sense of security...which increases their risk profile.

So given the above, for the time being we have opted not to implement any JS literal sanitization under the belief that our customers should be taking care of their own security techniques, and that we don't want to force them into strategies (or dependencies) that won't work for them.

But like I said, it's a good suggestion, and one that I've wrestled with throughout Highcharts for Python's development. There are some other options I've considered as well, like adding an option to use generative AI models to evaluate the riskiness of the callback functions/formatters used (similar to how we now allow usage of generative AI to convert Python callables to JS functions).

So it's definitely worth considering, and we're very open to your (and other users') thoughts on the subject.

hcpchris commented 10 months ago

Ah - and I just re-read the original issue again, and realized you had a different suggestion that may also be worth considering: adding a functionality to the Highcharts Core (JS) library that accepts the Highcharts for Python-produced JS literal string and then sanitizes it.

Sorry - my brain always focuses on Highcharts for Python first, rather than the Highcharts (JS) library itself.

That also may be worth considering. It has some (but not all) of the same trade-offs outlined above, but it is definitely something worth discussing with the Highcharts (JS) team.

highcharts-for-python / highcharts-core

FEATURE REQUEST: JS Literal Sanitising #121