selik / xport

Python reader and writer for SAS XPORT data transport files.
MIT License
49 stars 24 forks source link

TypeError: data type "string" not understood #41

Closed NicolasDupuis closed 4 years ago

NicolasDupuis commented 4 years ago

Hello,

Thanks for maintaining this package, it's quite helpful.

I'm trying to run it and, while typing exactly what's in the help section, I'm getting a strange error message. I'm pretty sure it used to work. I'm using version 3.1.3 (from Anaconda).


import pandas
import xport
import xport.v56

df = pandas.DataFrame({
    'alpha': [10, 20, 30],
    'beta': ['x', 'y', 'z'],
})

...  # Analysis work ...

ds = xport.Dataset(df, name='DATA', label='Wonderful data')
for k, v in ds.items():
    v.label = k               # Use the column name as SAS label
    v.name = k.upper()[:8]    # SAS names are limited to 8 chars
    if v.dtype == 'object':
        v.format = '$CHAR20.' # Variables will parse SAS formats
    else:
        v.format = '10.2'

library = xport.Library({'DATA': ds})
# Libraries can have multiple datasets.

with open('example.xpt', 'wb') as f:
    xport.v56.dump(library, f)

Getting this log in Jupyter:


Converting column 'alpha' from int64 to float
Converting column 'beta' from object to string
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\dtypes\common.py in pandas_dtype(dtype)

TypeError: data type "string" not understood

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
C:\ProgramData\Anaconda3\lib\site-packages\xport\v56.py in __bytes__(self)
    613                 try:
--> 614                     self[column] = self[column].astype(dtype)
    615                 except Exception:

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, errors, **kwargs)
   5690             # GH 24704: use iloc to handle duplicate column names
-> 5691             results = [
   5692                 self.iloc[:, i].astype(dtype, copy=copy)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\managers.py in astype(self, dtype, **kwargs)
    530                     for b in blocks
--> 531                 ]
    532 

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\managers.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
    394 
--> 395         self._consolidate_inplace()
    396 

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\blocks.py in astype(self, dtype, copy, errors, values, **kwargs)
    533             return self.make_block(nv)
--> 534 
    535         # ndim > 1

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\blocks.py in _astype(self, dtype, copy, errors, values, **kwargs)
    594             return self.make_block(Categorical(self.values, dtype=dtype))
--> 595 
    596         dtype = pandas_dtype(dtype)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\dtypes\common.py in pandas_dtype(dtype)

TypeError: data type 'string' not understood

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-43-8d39bacd8d51> in <module>
     23 
     24 with open('example.xpt', 'wb') as f:
---> 25     xport.v56.dump(library, f)

C:\ProgramData\Anaconda3\lib\site-packages\xport\v56.py in dump(library, fp)
    905 
    906     """
--> 907     fp.write(dumps(library))
    908 
    909 

C:\ProgramData\Anaconda3\lib\site-packages\xport\v56.py in dumps(library)
    924 
    925     """
--> 926     return bytes(Library(library))

C:\ProgramData\Anaconda3\lib\site-packages\xport\v56.py in __bytes__(self)
    704             b'created': strftime(self.created if self.created else datetime.now()),
    705             b'modified': strftime(self.modified if self.modified else datetime.now()),
--> 706             b'members': b''.join(bytes(Member(member)) for member in self.values()),
    707         }
    708 

C:\ProgramData\Anaconda3\lib\site-packages\xport\v56.py in <genexpr>(.0)
    704             b'created': strftime(self.created if self.created else datetime.now()),
    705             b'modified': strftime(self.modified if self.modified else datetime.now()),
--> 706             b'members': b''.join(bytes(Member(member)) for member in self.values()),
    707         }
    708 

C:\ProgramData\Anaconda3\lib\site-packages\xport\v56.py in __bytes__(self)
    614                     self[column] = self[column].astype(dtype)
    615                 except Exception:
--> 616                     raise TypeError(f'Could not coerce column {column!r} to {dtype}')
    617         header = bytes(MemberHeader.from_dataset(self))
    618         observations = bytes(Observations.from_dataset(self))

TypeError: Could not coerce column 'beta' to string

Any idea what's causing this?

thanks a lot,

Kind regards, Nicolas

selik commented 4 years ago

@NicolasDupuis Mind telling me what version of Pandas you're using?

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '1.0.3'

The 'string' dtype is available from Pandas >= 1.0.

I tried to enforce the Pandas version, but pip may just write a warning instead of upgrade Pandas when you install xport. https://github.com/selik/xport/blob/f894c01b6c6ba2b060f6f31c214508bd093e671a/setup.cfg#L43

NicolasDupuis commented 4 years ago

hi,

Sure, sorry, I'm using pandas 1.0.3.

thanks!

selik commented 4 years ago

This is mysterious. Pandas v1.0.3 should understand 'string' dtype, yet it's giving you TypeError: data type 'string' not understood. I couldn't reproduce the error except by downgrading Pandas below v1.

NicolasDupuis commented 4 years ago

Hi. Well, I have no idea what happened but now it works. I first tried in Spyder, I thought maybe that was because of Jupyter. It worked without error. Then I tried again in Jupyter and it worked. I didn't update anything. I'm using Anaconda, maybe something happened behind the scene, I dunno. Quite weird. Anyway, thanks for your time. Bye.

selik commented 4 years ago

@NicolasDupuis no worries. Sorry you had the frustration. Dependency management is a pain. If I'm working in an IDE like Spyder, I sometimes get confused between my terminal's activated Conda environment and the IDE's selected Conda environment.