Closed cchrysostomou closed 4 years ago
@costas821 I cannot reproduce this (also using Windows 7, pandas 0.17.1). If you run the above code sample in a new session, you get that error?
this fails on the astype. dtype S1
(and all fixed sized string dtypes are) not supported and should be converted to object
. Kind of puzzled why this is not. So I'll mark this as a bug.
So .astype('U1')
works as excepted (IOW it coerces to object
), but we need to either raise on S
dtypes in PY3 I think (or just coerce as we do unicode), though the user is technically saying that want to encode.
Well I was kind of hoping that datatype could be supported. When its represented as an object, the memory it takes up is extremely high when all I need is for for each cell to take up a single byte. Everything except for 'printing' seemed to work for me. Is there any work-around for this?
you are much better off using categoricals
# your frame
In [17]: df.memory_usage(deep=True).sum()
Out[17]: 2300072
In [18]: uniques = np.sort(pd.unique(df.values.ravel()))
# converted to categoricals (I happen to preserver the mappings, but its actually not necessary)
In [19]: df.apply(lambda x: x.astype('category',categories=uniques)).memory_usage(deep=True).sum()
Out[19]: 84572
OK I can go that route, but now I am having some functionality issues. Some things that worked before, no longer work when I set it as a category. If you don't think this is pertinent to the issue, then should I just send you a personal message of what I am trying to do and some sample code?
# set my frame as category
uniques = np.sort(pd.unique(df.values.ravel()))
df = df.apply(lambda x: x.astype('category', categories=uniques))
# slicing and search operations
df_ints = pd.DataFrame(np.zeros((10000, 500)))
df_ints[5,3] = 1
# when df is a category, I cannot do the following
df[df_ints==0] = 'Z'
# this also raises an error
df_ints == 'A'
categoricals have a sets that are allowed, IOW, to the categories
themselves. You can
In [75]: df2 = df.apply(lambda x: x.astype('category', categories=uniques.tolist() + ['Z']))
In [77]: df2.iloc[0,1] = 'Z'
Whoops that was a bad example, my mistake. What I was trying to show was that I cannot use the dataframe df_ints to change values:
df[df_ints==0] = 'A' # where 'A' is already defined in set.
or find where df is a:
df[df=='A']
hmm, that should work, see #12861 . well good of you to test this out!
In the meantime you can do .astype('U1')
to save some memory (or of course pull-requests to fix issues always welcome!)
This looks fixed on master. Could use a test.
Removing the p2/p3 compat label, as Python2 is being dropped and this issue still needs tests.
This was fixed by https://github.com/pandas-dev/pandas/pull/30327 (https://github.com/pandas-dev/pandas/commit/ccbe7be367e970dcbcd526f6b883b9db20979638 specifically I think).
I am trying to create a dataframe where each cell is represented as a single characters rather than python objects. I am able to create and work with the dataframe when using .astype command. However, If i try to print out a larger portion of the table, then I get an error.
Code Sample, a copy-pastable example if possible
error raised
output of
pd.show_versions()
commit: None python: 3.4.4.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 61 Stepping 4, GenuineIntel byteorder: little LC_ALL: None LANG: None
pandas: 0.17.1 nose: None pip: 8.1.1 setuptools: 20.3 Cython: 0.23.4 numpy: 1.10.4 scipy: 0.17.0 statsmodels: 0.6.1 IPython: 4.1.1 sphinx: 1.4b1 patsy: 0.4.0 dateutil: 2.4.2 pytz: 2015.7 blosc: None bottleneck: 1.0.0 tables: 3.2.2 numexpr: 2.4.6 matplotlib: 1.5.1 openpyxl: 2.3.2 xlrd: 0.9.4 xlwt: 1.0.0 xlsxwriter: 0.8.4 lxml: 3.5.0 bs4: 4.4.1 html5lib: None httplib2: None apiclient: None sqlalchemy: 1.0.11 pymysql: None psycopg2: None Jinja2: 2.8