physical-computation / sunflower-embedded-system-emulator

Sunflower Full-System Hardware Emulator and Physical System Simulator for Sensor-Driven Systems. Built-in architecture modeling of Hitachi SH (j-core), RISC-V, and more.
http://sflr.org
BSD 3-Clause "New" or "Revised" License
24 stars 216 forks source link

Explanation of taint analysis/tracking and ideas about extensions #103

Open HAM41 opened 5 years ago

HAM41 commented 5 years ago

Taint Analysis Setup (Introduction)

The original purpose of the taint tracking was to be able to follow all instructions which used information from a certain source (e.g. a sensor), which could be marked by the user with a certain taint colour. A basic form of taint tracking has been implemented for sunflower. This can be enabled/disabled before compiling by setting the SF_TAINTANALYSIS flag in config.h to 1/0. The functions used for taint analysis are defined in taint.c, and taint.h and used in op-riscv.c . Other files which include any amount of implementation of taint analysis (to find these use "rg -i taint") are: main.c main.h lex-riscv.c sf-riscv.y sf.h (which includes taint.h) machine-riscv.h machine-hitachi-sh.c pipeline-riscv.c and mfns.h.

Taint is handled thus: the user can use the taintreg and taintmem instructions to mark a register or piece of memory as tainted, assigning a PC value at which this acts as an origin of taint, and a taintcolour that should then be propagated. Taintcolours are uint64_t types, allowing there to be up to 64 different sources of taint in a program. For each register and each byte of memory there is a ShadowMem struct (which are contained in the arrays taintR (taintfR for float registers), and TAINTMEM which are defined just below the arrays for normal registers and normal memory) which holds information on the taint of that register/memory address. The ShadowMem struct contains a taint colour and memType (an enumeration type defined in taint.h), and for every RISCV assembly instruction in the op-riscv.c file, the taint of the inputs is put into a bitwise OR function, and the resulting taint is written to the output. This is achieved by the taintretreg, ftaintretreg and taintretmems functions which return the taint colour of a register/floating point register/memory address, and the taintprop function, which takes two taint colours, performs a bitwise OR, and writes the result to the shadow output register. Each time taintprop is called, the taint of the inputs is also passed to the array of ShadowMem values for each instruction (instruction_taintDistribution), so that that instruction is marked as having handled a certain taint.

Extension ideas

Liberal/Conservative tainting

This idea came out of a discussion with Samuel Wong.

The idea is that taint propagation falls into two different categories: direct propagation, which involves a tainted register being copied to another register, or being the input to a logic operation whose output is written to another register. Indirect propagation is when a tainted register is involved in control flow; for example if register A is tainted, and B, C and D are all untainted then the code below would be an example of indirect propagation:

if ( read(A) > read(B))
{
       C++;
}
else
{
       D++;
}

The proposition from the discussion was to have two different modes of tainting. Liberal tainting would cause taint to be passed on in both direct and indirect propagation, whereas conservative tainting involves taint only being passed on if there is direct propagation.

To implement this, there should be a flag in the State structure which could be set or cleared by the user to determine whether tainting should be liberal/conservative. Currently (27/08/2019) this is not being implemented, with all tainting being fully liberal.

Taint History

The implementation of taint analysis which is currently (27/08/2019) being worked on has the following rule on taint and overwriting: if a register is overwritten, its taint is changed to the taint of the things overwriting it. Therefore if register A has no taint, and register B is tainted, but subsequently register B is overwritten to have the same value as A, after the overwrite operation, B should also have no taint. Potentially an interesting extension is to look at taint history, that is to be able to see all of the different taint values that a particular register or memory address had at any point in its history.

Immediate Taints

Currently immediate values, such as the imm0 input for the addi instruction in op-riscv.c are treated as entirely free of taint always. In certain cases, perhaps these immediate values are taken originally from a tainted source, and so should lead to taint propagation. Assigning and propagating taint for immediates presents two problems: the taint must be propagated when the addi instruction is called, as some immediates may be free of taint, and other won't be. In addition, there is no Shadow data structure for immediate values, so a new one would have to be created. One idea is to have an array of taint of immediates, which might be 10 items long. Then when an immediate value is loaded from somewhere, it can be assigned a place in that array, and given the corresponding taint, which can be propagated later when the immediate is used. However, the immediate array would have to be cleared straight after this to prevent writing taints incorrectly at later calls of instructions with immediate values.

Taint Granularity

Currently each register has one taint colour, and each byte of memory has its own taint. One extension that might be useful would be the idea of varying the size of each unit of taint. For example, if you assigned one taint for the upper 16 bits of a register and a different taint for the lower 16 bits of the register, then you could be more precise in taint propagation in cases such as the LH instruction, where really only half the register being written to should be tainted.

Implementing for other processors

Currently taint analysis is only implemented for the RISC-V processor, whereas Sunflower can also emulate the Hitachi SH as well as the TI MSP430. Implementing taint analysis for these other processors would make sense and should be relatively easy as all the structures from the implementation for RISC-V are already in place (note the person saying this has no knowledge of the Hitachi SH or TI MSP430, so they may in fact have significant differences which might make the task more difficult).

HAM41 commented 5 years ago

Additional Extension ideas

Safety checking

Currently there is no checking that occurs to see if memory addresses and register addresses passed to taint propagation functions are within an acceptable range. Hence if memory indexing finishes at address 50 and a memory address of 75 is fed into the taint analysis functions this would lead to undefined behaviour. By putting the taint propagation code at the bottom of each RISC-V assembly instruction in op-riscv.c, I assumed that any out of bounds values would be caught earlier and lead to an appropriate error message. Obviously, this is inherently risky and if someone has time to improve the taint analysis implementation, I would prioritise this.

Automate taint marking

This idea was independently conceived by Phillip Stanley-Marbell.

Currently the marking of a tainted register/fregister/memory address is done by hand using a taintreg or equivalent instruction in a '.m' file, as exemplified for a bubblesort program below:

newnode     riscv
sizemem     96000000
srecl       bubble.sr
taintmem    0x0000000008009888 0x50 1 4
run
on

This requires people to know the memory address at which data is stored, and the PC value at which it is accessed, which requires looking at debugger information. Ideally, this process would be automated so that a user could mark a variable in an input program as an origin of taint, and then the correct taint addresses and PC values would automatically be marked as tainted.

Unimplemented RISC-V instructions

There are currently 10 RISC-V instructions which have no implementation whatsoever. These can be found in the op-riscv.c file. Look for the fence, or csrrci instruction. If and when these instructions are implemented, taint analysis should be implemented where relevant. Similarly the ecall instruction, as it appears to have no relation to any tainted register, has no taint analysis implemented.

More Usage Statistics

Currently the only statistics available to the user of a program are the taints that each RISC-V assembly instruction has come into contact with at some point. These can be accessed by the dumptaintdistr Sunflower command. Other statistics that might be interesting to implement include:

-Dumping every register (/ floating point register / memory address) that has a specific taint, or any non-zero taint. -Dumping every register which, at a certain point in program execution, has a specific taint.

Endianness

Endianness is not an issue which has been considered or worried about at all in the implementation of taint analysis. If any memory or register addresses are ever referred to in by the little endian version of their address, this will cause problems with the current taint analysis implementation. This was assumed to not be likely, hence no solution was implemented.