snRuntime: Add `alloc_v2` functions and minor changes

Move precision and SIMD vector types from DNN and BLAS libraries to the Snitch runtime.
Rename snrt_l1alloc and snrt_l3alloc to snrt_l1_alloc and snrt_l3_alloc respectively, in line with all other alloc functions.
Replace size field with end in snrt_allocator_t, for faster bound checks.
Add alloc_v2 functions: allocator structs are core-local, not shared for faster access. This imposes the constraint that all cores update their pointers, to keep them aligned (trades off additional computation for performance). To simplify this, we provide convenience functions snrt_l1_alloc_cluster_local, snrt_l1_alloc_compute_core_local which need to be called by every core.
Split snrt_global_barrier into snrt_inter_cluster_barrier plus cluster barrier.
Add snrt_cluster_is_last_compute_core function as may be used to handle remainder iterations on the last compute core.
Move reusable functions in start.c to start.h for proper inlining. Move all defines to *_start.h and include this in *_start.c to ensure consistent definitions.

pulp-platform / snitch_cluster