Closed mccalluc closed 2 weeks ago
Move this to the API docs, as an example of UDF. -- Make a separate page for UDF functionality.
UDF will be called "plugins" going forward.
Get rid of faker, but ok to add pandas as a test dependency.
Goldilocks columns make sense.
To get it to run in a reasonable period of, the size of the dataset needed to be much reduced. I also started out with 10 of each kind of column, but 2^30 is a big number. And the column sets it returns still return about half too_diverse
or too_uniform
.
I feel like the optimizations needed to make this reasonable for more rows and more columns could get in the way of the demonstration of plugins.
But maybe this is fine?
It looks like the pandas dev dependency (and numpy in particular) are not installing well on python 3.8:
ERROR: Ignored the following versions that require a different python version: 1.25.0 Requires-Python >=3.9; 1.25.0rc1 Requires-Python >=3.9; 1.25.1 Requires-Python >=3.9; 1.25.2 Requires-Python >=3.9; 1.26.0 Requires-Python <3.13,>=3.9; 1.26.0b1 Requires-Python <3.13,>=3.9; 1.26.0rc1 Requires-Python <3.13,>=3.9; 1.26.1 Requires-Python <3.13,>=3.9; 1.26.2 Requires-Python >=3.9; 1.26.3 Requires-Python >=3.9; 1.26.4 Requires-Python >=3.9; 2.0.0b1 Requires-Python >=3.9; 2.0.0rc1 Requires-Python >=3.9
ERROR: Could not find a version that satisfies the requirement numpy==1.26.4
pip-compile
in a 3.8 environment would produce a requirements-dev.txt
that works for python 3.8?numpy
in requirements-dev.in
to make sure that we get a version that works in 3.8?I'm more excited about dropping 3.8, than in jumping through more hoops to make it happy, so I'll try that. Can always revert if that's not the right course.
3.9 seems to be the fix. At this point, I think this is good overall: I think it would be good to clarify whether this is really what we're proposing this for real-life, or if it's just an example.
Is there a combinators plugin example? Will look.
Can make a separate PR for other 3.8 -> 3.9 work.
Is there a combinators plugin example? Will look.
Nope! Don't worry about it!
(Just flagged either of you as reviewers, since we seem to agree that this is going in the right direction.)
This runs locally with faker and pandas installed, but I think we'd prefer not to add those as dev dependencies?
I'm also just not sure how persuasive this example is: It feels like we're doing a lot of work to just pull one example from a very small set.
Proposal: The fake dataset should have a lot of columns, and the column data should vary systematically, so we can see that it's picking a good set of columns. My hope is that if we gave it
it would pick the small random integer columns for grouping. Is that the behavior we'd expect, and would it make a good example?