grouping columns example

mccalluc commented 3 weeks ago

Fix #1488

This runs locally with faker and pandas installed, but I think we'd prefer not to add those as dev dependencies?

I'm also just not sure how persuasive this example is: It feels like we're doing a lot of work to just pull one example from a very small set.

Proposal: The fake dataset should have a lot of columns, and the column data should vary systematically, so we can see that it's picking a good set of columns. My hope is that if we gave it

10 columns of constant values
10 columns of small random integers 0-9
and 10 columns of random floats

it would pick the small random integer columns for grouping. Is that the behavior we'd expect, and would it make a good example?

mccalluc commented 3 weeks ago

Move this to the API docs, as an example of UDF. -- Make a separate page for UDF functionality.

UDF will be called "plugins" going forward.

Get rid of faker, but ok to add pandas as a test dependency.

Goldilocks columns make sense.

mccalluc commented 2 weeks ago

To get it to run in a reasonable period of, the size of the dataset needed to be much reduced. I also started out with 10 of each kind of column, but 2^30 is a big number. And the column sets it returns still return about half too_diverse or too_uniform.

I feel like the optimizations needed to make this reasonable for more rows and more columns could get in the way of the demonstration of plugins.

But maybe this is fine?

mccalluc commented 2 weeks ago

It looks like the pandas dev dependency (and numpy in particular) are not installing well on python 3.8:

ERROR: Ignored the following versions that require a different python version: 1.25.0 Requires-Python >=3.9; 1.25.0rc1 Requires-Python >=3.9; 1.25.1 Requires-Python >=3.9; 1.25.2 Requires-Python >=3.9; 1.26.0 Requires-Python <3.13,>=3.9; 1.26.0b1 Requires-Python <3.13,>=3.9; 1.26.0rc1 Requires-Python <3.13,>=3.9; 1.26.1 Requires-Python <3.13,>=3.9; 1.26.2 Requires-Python >=3.9; 1.26.3 Requires-Python >=3.9; 1.26.4 Requires-Python >=3.9; 2.0.0b1 Requires-Python >=3.9; 2.0.0rc1 Requires-Python >=3.9
ERROR: Could not find a version that satisfies the requirement numpy==1.26.4

Maybe running pip-compile in a 3.8 environment would produce a requirements-dev.txt that works for python 3.8?
Or maybe add a constraint on numpy in requirements-dev.in to make sure that we get a version that works in 3.8?
Or, drop 3.8 support? This is on our radar (#1393), and moving our minimum version to 3.9 would also let us improve/simplify typing.

I'm more excited about dropping 3.8, than in jumping through more hoops to make it happy, so I'll try that. Can always revert if that's not the right course.

mccalluc commented 2 weeks ago

3.9 seems to be the fix. At this point, I think this is good overall: I think it would be good to clarify whether this is really what we're proposing this for real-life, or if it's just an example.

mccalluc commented 2 weeks ago

Is there a combinators plugin example? Will look.

Can make a separate PR for other 3.8 -> 3.9 work.

Shoeboxam commented 2 weeks ago

Is there a combinators plugin example? Will look.

Nope! Don't worry about it!

mccalluc commented 2 weeks ago

(Just flagged either of you as reviewers, since we seem to agree that this is going in the right direction.)

opendp / opendp

grouping columns example #1508