w3c / machine-learning-workshop

Site of W3C Workshop on Web & Machine Learning
https://www.w3.org/2020/06/machine-learning-workshop/
42 stars 16 forks source link

Protecting ML models #67

Open dontcallmedom opened 3 years ago

dontcallmedom commented 3 years ago

@jasonmayes highlights the need from some ML providers to ensure their ML models cannot be extracted from a browser app.

This need is similar to what was raised by some media providers (which led, not without controversy to the definition of Encrypted Media Extensions.

A similar need was also expressed at the Games workshop for 3D assets last year - @tidoust, was there any conclusion there that may apply here as well?

It would be useful to understand exactly what level of protection would be needed for what use cases, since these types of considerations are known to be both technically challenging and at odds with the role of the browser acting on behalf of the end-user.

tidoust commented 3 years ago

A similar need was also expressed at the Games workshop for 3D assets last year - @tidoust, was there any conclusion there that may apply here as well?

No definitive conclusion but this was deemed difficult in the case of 3D assets, at least from a technical perspective: applications need to manipulate 3D assets in a number of ways for rendering, and it seems hard to split code that touches on these assets to a separate sandbox. It was also noted that 3D assets are not protected per se in native applications. The issue is particularly relevant in Web scenarios where it is easier to extract them, simply going through dev tools. This may suggest a mode where content is not protected per se but "hidden" from users, which could perhaps also apply to ML models. That would be at odds with the usual ability to copy and paste web content though. For 3D assets, some hacks (such as splitting assets into pieces and re-assembling the pieces on the fly) can be used to make the extraction more challenging.

jasonmayes commented 3 years ago

Indeed my thoughts at least were along the lines of:

  1. Server generates a script tag src that contains a nonce such that it can only be used once (though this is a server implementation detail and not part of the front end spec) and such a tag would be written into the HTML of the rendered page like this:

<script src="myMachineLearning.js?nonce=somestring" secureExecution="true"></script>

Note the secureExecution attribute (or whatever it were to be called) indicating the browser should retrieve this special 1 time link to download and store in private memory for execution away from devtools / webpage JS scope.

  1. Browser downloads and executes said script and any further resources it requests - all privately behind the scenes. HTTPS would be used for any further asset downloads by said script to ensure encryption and those requests would also not be revealed to devtools networking panels etc. However on that note amount of memory used, processor usage etc may be possible to be shown to such environments for the sake of debugging such scripts and any console.logs in such code would still print to console if the developer chooses to do so.

  2. Regular Client side JS maybe can listen for when new secure scripts are loaded and ready to interact with. They can then set up a comms channel much like existing web workers etc have today to call remote functions and get results back.

  3. One big caveat to be aware of though is for ML we often need to use WASM / WebGL / WebGPU etc for perf reasons. So whatever environment this sandbox was running in should support these sorts of technologies so said ML can run as fast as possible. Currently Web Workers do not support all of these across all browsers and is one reason we were unable to fully utilise web workers at present.

These are all just ideas I am riffing on right now and I welcome further discussion from others as I am just brain dumping some initial thoughts so this is by no means a final proposition or anything like that :-) Would love to hear your thoughts on this topic though as we have requests for this very frequently from users of TFJS, especially businesses.

Adding @pyu10055 @dsmilkov @nsthorat @annxingyuan @tafsiri @lina128 for any thoughts related to this topic too (TF.js team) and for visibility.

dontcallmedom commented 3 years ago

thank you for bringing in your more detailed thoughts.

I think your point 4 would be worthy of a separate dedicated issue (since it is overall independent of the question of protecting models) - would you mind creating one?

A potential risk that bringing in this kind of black box is that we already know that e.g. WebAssembly is being abused for cryptomining and making this harder to detect or debug would likely not be an improvement. This I guess links to some of the discussion in #72.

Stepping back a bit from specific proposals, how is this being managed in native space? Are there clear requirements to the level of protection that would be sought? (I assume that not all model providers would have the same requirements, but getting a somewhat clearer landscape of what kind of model providers would require what kind of requirements would help getting a sense of the needs in this space)

dontcallmedom commented 3 years ago

some discussion on this in the ML Loader API explainer /cc @jbingham

jasonmayes commented 3 years ago

In essence for TensorFlow.js at least when you make an ML system you have:

  1. Preprocessing - the act of taking some data and turning it into Tensors because ML models only understand numerical data. Sometimes this preprocessing is not trivial and can be a significant work that may want to be protected.

  2. The ML Model. Which contains the weights and essentially takes the bunch of numbers from 1 and flows through all the mathematical ops in the model graph to produce some tensor output (more numbers). For TensorFlow.js this is the model.json + the .bin files.

  3. Post processing - again taking the output (bunch of numbers) from the model we usually have to do some work to make it useful to real users and this can be non trivial and quite often would want to be protected too.

For this reason all 3 parts of this would probably be desirable to execute in a secure context which is why I was suggesting JS being run in its own environment instead of just the model data being kept there. That way it is flexible for ML devs to reveal as much or as little as they are comfortable to do. Also this means if we are talking about JS it can apply more generically to other areas too - eg maybe authentication data - API keys etc for other forms of JS use cases too which can be kept securely away from the inspectable code of the web app.

pyu10055 commented 3 years ago

This could be similar to content protection standard like DRM for video, the difference is it needs a protected execution context (PEC):

From developer point of view:

tidoust commented 3 years ago

Reflecting on yesterday's live session discussion, and on top of other considerations raised already, I think it would be useful to clarify how the lack of content protection would affect ML businesses, especially for people outside of ML circles (like me) who may struggle to evaluate the impacts.

For instance, one possible way to present some of the media dimensions around content protection could be: movies may cost $300 millions to produce. Most people view movies only once. A substantial portion of the income comes from the first few days/weeks of distribution so an early leak has a huge impact. Also, the companies that distribute media content may not be the ones that produce it, and distributors are required per contract to protect content. There are most likely other dimensions to consider.

Back to ML models, some possible questions:

wseltzer commented 3 years ago

As I noted in the discussion session, we need to consider serious privacy and security concerns here. If end-users are being asked to download and run code they can't inspect, and permit it to send data they can't control, both open wide gaps in privacy and security expectations. The scope -- and hence the potential harms -- seem much less constrained than was the case for EME CDMs and encrypted video.

jasonmayes commented 3 years ago

My thoughts:

If one is gathering custom data for an ML model it is certainly not unheard of to see costs reach in the magnitude of 100's of thousands of dollars to buy time of many humans to do very niche tasks (especially if complex) and repeat that enough times to get suitable quantity of data for a certain thing if you want high quality data.

Of course you could use something like Amazon Mech Turk or some tasks but whilst this is cheaper the quality of results coming back may be less so you end up needing more repetition to weed out incorrect labelling etc and a lot of data sanitization so it sort of balances out. Obviously this can fluctuate a lot depending on the complexity of the task at hand. Some may be much cheaper - especially if existing data sets exist or if the task is easy to understand and fast to do. This cost is just for the data collection however.

You must then add to this the cost of training - which can take weeks if you have terabytes of data from that data collection effort using many servers in the cloud concurrently, plus the cost of hiring ML Engineers to design and make that model (each developer is probably on a 6 figure salary if in the main cities and there may be multiple working on one project), and then of course the ongoing cost of optimizing and refining to improve and iterate. I could see this cost for a very robust and niche model easily hitting the millions too depending on the size of the task at hand / complexity of the model etc in terms of end costs to the company involved and this is why they are so protective over such models.

Many production use cases for business right now are accessed via Cloud API but this is mainly because model security can not be guaranteed in browser so this is the only option for business to give access to their model via a remote API which is locked down but then of course you lose the benefits of executing client side - offline inference, privacy, lower latency, and potentially lower cost too as less server usage needed - just a CDN to deliver the model vs all the GPU/CPU/RAM you would need to run otherwise for inference.

I think leaking a model is much like leaking a movie. If you have it, you can certainly distribute it and the original owner has no control if that ends up on bittorrent or whatever service it may get shared on for then be downloaded by thousands of others. Of course if the model could be traced and verified it was an illegal copy somehow then legal action could be taken, but still that is a slow and costly process in itself to prove and is probably enough to put off smaller/mid sized companies from getting into the situation in the first place not to mention the legal system has not really caught up with the finer points of the ML industry yet either.

With regard security; there may be some middle ground where only the certain things are processed in this way maybe - eg this sandboxed environment has no networking ability - some features of JS disabled etc so anything that could be ultimately sent is still inspectable - eg whatever is passed back from this black box is inspectable and when inside the black box the only way out is by returning the result which is then ultimately inspectable by client side after it ran through all the weights of the model to be transformed. Though obviously this needs more thought as to what such a sandboxed environment would look like and if that is then still useful enough to do various operations required by models today etc. Maybe @pyu10055 can chime in with what is needed to execute a model these days on the lower level side of things as I am not working at that level right now and if any of those ops etc would need networking ability or any features that could be deemed insecure?

One of the key points of bringing this to client side is so that data does not need to be sent to a server from device sensors, thus increasing privacy for the end user in fact, and if it does send anything on network, that would be not in the sandbox environment so inspectable as per normal? I may have missed something here though so feel free to let me know as it is rather late here :-)

dontcallmedom commented 3 years ago

thanks @jasonmayes to help quantifying the issue!

During the live session , we discussed the potential usage for "split" models, where the latency/privacy sensitive inferencing would be done on the client, while some of the IPR-critical pieces might be kept on the server. @pyu10055 suggested there had been work in that direction - could you share relevant pointers in this space? (in my superficial exploration, I've seen work on distributed/split learning, not so much on distributed/split inferencing)