Add some features that make it easier to do A/B comparison on punet.

run_punet.py gets a --inputs argument, where a safetensors file can be supplied to override random inputs.
Added APIs for trace_tensor(golden=True) which is meant to be kept permanently at key points in the model (as opposed to normal trace_tensor which is meant as a debugging aid to be removed when done).
Added TURBINE_LLM_DEBUG=save_goldens_path=...some/path env var setting. This dumps a sequenced list of safetensors files for every tensor traced with golden=True.
Added golden traces at input/output/down/mid/up block boundaries in the punet model.
Verified with an offline script that without our optimized linear layer, we are within the margin of error with stability unet, using Brevitas i8 quantization. The layer with optimized math enabled diverges slowly, causing a large deviation on the whole model level. This should be easy to track down with the additional instrumentation.

nod-ai / sharktank