kieferk / dfply

dplyr-style piping operations for pandas dataframes
GNU General Public License v3.0
889 stars 103 forks source link

Suggestings for improvements on the 'from dfply import *' front #75

Closed TyberiusPrime closed 5 years ago

TyberiusPrime commented 5 years ago

There's an universal (and justified) dislike in the python community for * imports. Now I admit that dfply (great work btw) is a pain without it.

But, it currently has a bunch of things in the user-importable namespace that we could possibly clean up.

A quick accounting of the 129 exports from dfply by type:

type count
<class 'dfply.base.Intention'> 1
<class 'dict'> 1
<class 'NoneType'> 1
<class '_frozen_importlib_external.SourceFileL... 1
<class 'list'> 1
<class '_frozen_importlib.ModuleSpec'> 1
<class 'pandas.core.frame.DataFrame'> 1
<class 'type'> 5
<class 'str'> 6
<class 'module'> 16
<class 'dfply.base.pipe'> 40
<class 'function'> 55

where presumably only the dfply.base.pipe and a subset of the functions are 'verbs'.

My suggesting would be to introduce to additional namespaces

and update the examples to use from dfply.verbs import * instead of from dfply import *

This way we would a) not break anyones code and b) have a clean, 'non polluting' module that users can import.

What do you think?

TyberiusPrime commented 5 years ago

To be fair, we only export 120 symbols, the other 9 start with __, which cleans up the above table a bit.

type count
<class 'function'> 55
<class 'dfply.base.pipe'> 40
<class 'module'> 16
<class 'type'> 5
<class 'str'> 2
<class 'dfply.base.Intention'> 1
<class 'pandas.core.frame.DataFrame'> 1

Still we could at least drop the modules, strings, possibly the diamonds dataset (plotnine has it as well) from the main export?

pchtsp commented 5 years ago

Not sure if this helps or just puts more noice. I just started working with this library for some days (it's excellent by the way) and what I usually do is the following:

import dfply as dp
from dfply import X

For the rest of functions I simply use:

table >> dp.select(X.variable)
table >> dp.mutate(column = X.something)

For me it's actually quite natural since it's just the opposite symbol to what I used to use with pandas (dp instead of pd).

TyberiusPrime commented 5 years ago

I got so unhappy about the state of python dplyr clones, I wrote my own: https://github.com/TyberiusPrime/dppd

I've also written a comparison / rosetta stone for the python dplyr clones: https://dppd.readthedocs.io/en/latest/comparisons.html

kieferk commented 5 years ago

dppd looks cool, though I've only given it a cursory look. I'll have to go over the code in more detail but it looks like a pretty cool take on the NSE-in-python problem.

Perhaps one way to resolve the from dfply import * issue is just to explicitly list everything in the respective __init__.py files rather than having the import * statements. That way, when the user imports everything it should only pull the dfply-specific classes and functions.

Is that the crux of the issue here or am I misunderstanding the complaint?

kieferk commented 5 years ago

OK @TyberiusPrime so I made a new branch called import-fixes that you can checkout which removes the * import stuff from the __init__.py file. I'm not sure what you're using to profile the imports to namespace, but (if you still care) let me know if this branch resolves your namespace issue and i'll merge it into master.

TyberiusPrime commented 5 years ago

Those fixes will work nicely.

My 'profiling' was along these lines...

import dfply, collections
c = collections.Counter()
for d in dir(dfply):
   t = type(getattr(dfply, d))
   c[t] += 1
print(c)