Update benchmarking step to also test compilation stats

This commit adds support so that the benchmarking step also tests the compilation statistics of the model. I moved all the benchmarking into a sdxl subdirectory of benchmarking, so we can have different conftest.py for every model. This way we don't flood one conftest file with all the parameters needed for all the models.

Instead of <=, do you think I should maybe go with the == for the asserts regarding dispatches and size? (It will force contributors to update when it changes, and it is a static value)

Here is what updated job summary looks like (https://github.com/nod-ai/SHARK-TestSuite/actions/runs/9586667587):

nod-ai / SHARK-TestSuite

Update benchmarking step to also test compilation stats #265