add sqlglot benchmarks - Githubissues

tobymao commented 2 years ago

sqlglot is a pure python sql parser, transpiler, and optimizer

tobymao commented 2 years ago

i can add that in the future. execution is still in alpha

tobymao commented 2 years ago

@mdboom do you need anything from me in order to land these changes?

mdboom commented 2 years ago

@mdboom do you need anything from me in order to land these changes?

This looks fine to me, but I don't have merge rights. Maybe @ericsnowcurrently can have a look.

ericsnowcurrently commented 2 years ago

FWIW, there are some extra things to consider as we work on building good benchmark suites:

macro-benchmark vs. micro-benchmark (vs. in between)
- we need a good representation of the community's workloads in our macro-benchmark suite
- the micro-benchmark suite is important because it helps us identify the root cause of regressions more quickly
- we haven't carefully considered the value of benchmarks that fall in-between yet (I'm sure there are benefits to them)
categorizing/tagging benchmarks is important but can be hard; we haven't been diligent about it yet
we will probably end up dropping some of the benchmarks when we eventually get around to improving the suite selections

Relative to this benchmark specifically:

it feels like an in-between one (not quite a macro-benchmark but more complex than a micro-benchmark)
could it be be made represent a full Python workload more closely (or integrated into such a benchmark)?
what workloads would it represent or be a part of?
how much coverage of those workloads are already in the pyperformance suite?
how should this benchmark be categorized/tagged?

(I'm sure we'll merge it in regardless of the answers.)

ericsnowcurrently commented 2 years ago

Another thing to consider is that the sqlglot project should probably have this benchmark as part of its own suite (in its own repo), regardless of its inclusion in the pyperformance suite.

tobymao commented 2 years ago

@ericsnowcurrently ready for another look.

and sure, i can add these benchmarks to the own suite

tobymao commented 2 years ago

Relative to this benchmark specifically:

it feels like an in-between one (not quite a macro-benchmark but more complex than a micro-benchmark)

could it be be made represent a full Python workload more closely (or integrated into such a benchmark)?

what workloads would it represent or be a part of?

how much coverage of those workloads are already in the pyperformance suite?

how should this benchmark be categorized/tagged?

(I'm sure we'll merge it in regardless of the answers.)

in terms of workflows, it represents a good chunk in that people want to parse many sql queries (data engineering / analytics). the normalizer also represents mutation of queries which is another kind of macro workflow. there are some companies that use sqlglot to parse 10s of thousands of sql queries to extract out metadata.

sqlglot has a prototype engine which could represent more macro workflows, but it's not quite ready yet and not something i want to expose at this point.

ericsnowcurrently commented 2 years ago

Thanks for the benchmark!

python / pyperformance

add sqlglot benchmarks #221