pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.35k stars 17.81k forks source link

BUG: not properly converting S1 in astype ,on PY3 #12857

Closed cchrysostomou closed 4 years ago

cchrysostomou commented 8 years ago

I am trying to create a dataframe where each cell is represented as a single characters rather than python objects. I am able to create and work with the dataframe when using .astype command. However, If i try to print out a larger portion of the table, then I get an error.

Code Sample, a copy-pastable example if possible

import random
import pandas as pd
lets = 'ACDEFGHIJKLMNOP'
slen = 50
nseqs = 1000
words = [[random.choice(lets) for x in range(slen)] for _ in range(nseqs)]
df = pd.DataFrame(words).astype('S1')
#this will print correctly:
print(df.iloc[:60, :])
#this will raise an error:
print(df.iloc[:61, :])

error raised

C:\Anaconda3\lib\site-packages\pandas\core\internals.py in _vstack(to_stack, dtype)
   4248 
   4249     # work around NumPy 1.6 bug
-> 4250     if dtype == _NS_DTYPE or dtype == _TD_DTYPE:
   4251         new_values = np.vstack([x.view('i8') for x in to_stack])
   4252         return new_values.view(dtype)
TypeError: data type "bytes8" not understood

output of pd.show_versions()

commit: None python: 3.4.4.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 61 Stepping 4, GenuineIntel byteorder: little LC_ALL: None LANG: None

pandas: 0.17.1 nose: None pip: 8.1.1 setuptools: 20.3 Cython: 0.23.4 numpy: 1.10.4 scipy: 0.17.0 statsmodels: 0.6.1 IPython: 4.1.1 sphinx: 1.4b1 patsy: 0.4.0 dateutil: 2.4.2 pytz: 2015.7 blosc: None bottleneck: 1.0.0 tables: 3.2.2 numexpr: 2.4.6 matplotlib: 1.5.1 openpyxl: 2.3.2 xlrd: 0.9.4 xlwt: 1.0.0 xlsxwriter: 0.8.4 lxml: 3.5.0 bs4: 4.4.1 html5lib: None httplib2: None apiclient: None sqlalchemy: 1.0.11 pymysql: None psycopg2: None Jinja2: 2.8

jorisvandenbossche commented 8 years ago

@costas821 I cannot reproduce this (also using Windows 7, pandas 0.17.1). If you run the above code sample in a new session, you get that error?

jreback commented 8 years ago

this fails on the astype. dtype S1(and all fixed sized string dtypes are) not supported and should be converted to object. Kind of puzzled why this is not. So I'll mark this as a bug.

jreback commented 8 years ago

So .astype('U1') works as excepted (IOW it coerces to object), but we need to either raise on S dtypes in PY3 I think (or just coerce as we do unicode), though the user is technically saying that want to encode.

cchrysostomou commented 8 years ago

Well I was kind of hoping that datatype could be supported. When its represented as an object, the memory it takes up is extremely high when all I need is for for each cell to take up a single byte. Everything except for 'printing' seemed to work for me. Is there any work-around for this?

jreback commented 8 years ago

you are much better off using categoricals

# your frame
In [17]: df.memory_usage(deep=True).sum()
Out[17]: 2300072

In [18]: uniques = np.sort(pd.unique(df.values.ravel()))

# converted to categoricals (I happen to preserver the mappings, but its actually not necessary)
In [19]: df.apply(lambda x: x.astype('category',categories=uniques)).memory_usage(deep=True).sum()
Out[19]: 84572
cchrysostomou commented 8 years ago

OK I can go that route, but now I am having some functionality issues. Some things that worked before, no longer work when I set it as a category. If you don't think this is pertinent to the issue, then should I just send you a personal message of what I am trying to do and some sample code?

#  set my frame as category
uniques = np.sort(pd.unique(df.values.ravel()))
df = df.apply(lambda x: x.astype('category', categories=uniques))

# slicing and search operations
df_ints = pd.DataFrame(np.zeros((10000, 500)))
df_ints[5,3] = 1
# when df is a category, I cannot do the following
df[df_ints==0] = 'Z'  
# this also raises an error
df_ints == 'A'
jreback commented 8 years ago

categoricals have a sets that are allowed, IOW, to the categories themselves. You can

In [75]: df2 = df.apply(lambda x: x.astype('category', categories=uniques.tolist() + ['Z']))

In [77]: df2.iloc[0,1] = 'Z'
cchrysostomou commented 8 years ago

Whoops that was a bad example, my mistake. What I was trying to show was that I cannot use the dataframe df_ints to change values:

df[df_ints==0] = 'A' # where 'A' is already defined in set. or find where df is a: df[df=='A']

jreback commented 8 years ago

hmm, that should work, see #12861 . well good of you to test this out! In the meantime you can do .astype('U1') to save some memory (or of course pull-requests to fix issues always welcome!)

mroeschke commented 5 years ago

This looks fixed on master. Could use a test.

topper-123 commented 5 years ago

Removing the p2/p3 compat label, as Python2 is being dropped and this issue still needs tests.

TomAugspurger commented 4 years ago

This was fixed by https://github.com/pandas-dev/pandas/pull/30327 (https://github.com/pandas-dev/pandas/commit/ccbe7be367e970dcbcd526f6b883b9db20979638 specifically I think).