sdvillal / whatami

Easily provide python objects with self-identification
Other
9 stars 1 forks source link

Support for numpy arrays and pandas dataframes #5

Closed sdvillal closed 9 years ago

sdvillal commented 9 years ago

As of version 3.0.0 whatami does not handle properly numpy arrays or pandas dataframes:

>>> print(whatareyou(np.zeros(10000), add_properties=True).id())
ndarray(T=[ 0.  0.  0. ...,  0.  0.  0.],base=None,ctypes=_ctypes(data=25915040,shape=c_long_Array_1(),strides=c_long_Array_1()),data=memoryview(c_contiguous=True,contiguous=True,f_contiguous=True,format='d',itemsize=8,nbytes=80000,ndim=1,obj=[ 0.  0.  0. ...,  0.  0.  0.],readonly=False,shape=(10000),strides=(8),suboffsets=()),dtype=float64,flags=  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False,flat=flatiter(base=[ 0.  0.  0. ...,  0.  0.  0.],coords=(0),index=0),imag=[ 0.  0.  0. ...,  0.  0.  0.],itemsize=8,nbytes=80000,ndim=1,real=[ 0.  0.  0. ...,  0.  0.  0.],shape=(10000),size=10000,strides=(8))

>>>  print(whatareyou(pd.DataFrame(np.zeros((2, 2)), columns=['a', 'b''])).id())
DataFrame(T=   0  1
a  0  0
b  0  0,at=_AtIndexer(axis=None,name='at',ndim=2,obj=   a  b
0  0  0
1  0  0),axes=[Int64Index([0, 1], dtype='int64'),Index(['a', 'b'], dtype='object')],blocks={float64=   a  b
0  0  0
1  0  0},columns=Index(['a', 'b'], dtype='object'),dtypes=a    float64
b    float64
dtype: object,empty=False,ftypes=a    float64:dense
b    float64:dense
dtype: object,iat=_iAtIndexer(axis=None,name='iat',ndim=2,obj=   a  b
0  0  0
1  0  0),iloc=_iLocIndexer(axis=None,name='iloc',ndim=2,obj=   a  b
0  0  0
1  0  0),index=Int64Index([0, 1], dtype='int64'),is_copy=None,ix=_IXIndexer(axis=None,name='ix',ndim=2,obj=   a  b
0  0  0
1  0  0),loc=_LocIndexer(axis=None,name='loc',ndim=2,obj=   a  b
0  0  0
1  0  0),ndim=2,shape=(2,2),size=4,values=[[ 0.  0.]
 [ 0.  0.]])

Here we could go for hashes for big ones and other custom representations depending on their nature, possibly adding to the string things like dtypes, column names, contiguity... For example:

>>> print(whatareyou(np.zeros(10000), add_properties=True).id())
ndarray(data='ASHA32')

For pandas things might be a tad more complex if we would like to add also info about indices.

Since we do not want to depend on these libraries, we need to branch code for when they are not present (plugins)?