pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.62k stars 17.91k forks source link

Unable to import Stata 13 database files with read_stata() #7360

Closed refp16 closed 10 years ago

refp16 commented 10 years ago

pandas v0.14.0 (May 31 , 2014) seems uncapable of importing Stata 13 datasets although according to this http://pandas.pydata.org/pandas-docs/stable/whatsnew.html, it should. Stata 12 files can be imported without problems.

The output of running this

import pandas
pandas.show_versions()
dta = pandas.io.stata.read_stata('D:\\Datos\\rferrer\\Desktop\\myauto.dta')

follows:

%run D:/Datos/RFERRER/Desktop/import_stata13.py

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 45 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.14.0
nose: 1.3.0
Cython: 0.19.2
numpy: 1.8.0
scipy: 0.14.0
statsmodels: 0.5.0
IPython: 1.2.1
sphinx: 1.2.2
patsy: 0.2.0
scikits.timeseries: 0.91.3
dateutil: 2.2
pytz: 2013.8
bottleneck: None
tables: 2.4.0
numexpr: 2.2.2
matplotlib: 1.3.1
openpyxl: 1.8.5
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: None
lxml: 3.2.3
bs4: None
html5lib: 0.95-dev
bq: None
apiclient: None
rpy2: None
sqlalchemy: 0.8.3
pymysql: None
psycopg2: None
C:\Users\rferrer\AppData\Local\Enthought\Canopy\User\lib\site-packages\openpyxl\__init__.py:31: UserWarning: The installed version of lxml is too old to be used with openpyxl
  warnings.warn("The installed version of lxml is too old to be used with openpyxl")
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
C:\Users\rferrer\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.4.0.1938.win-x86_64\lib\site-packages\IPython\utils\py3compat.pyc in execfile(fname, glob, loc)
    195             else:
    196                 filename = fname
--> 197             exec compile(scripttext, filename, 'exec') in glob, loc
    198     else:
    199         def execfile(fname, *where):

D:\Datos\RFERRER\Desktop\import_stata13.py in <module>()
      3 pandas.show_versions()
      4 
----> 5 dta = pandas.io.stata.read_stata('D:\\Datos\\rferrer\\Desktop\\myauto.dta')

C:\Users\rferrer\AppData\Local\Enthought\Canopy\User\lib\site-packages\pandas\io\stata.pyc in read_stata(filepath_or_buffer, convert_dates, convert_categoricals, encoding, index)
     45         identifier of column that should be used as index of the DataFrame
     46     """
---> 47     reader = StataReader(filepath_or_buffer, encoding)
     48 
     49     return reader.data(convert_dates, convert_categoricals, index)

C:\Users\rferrer\AppData\Local\Enthought\Canopy\User\lib\site-packages\pandas\io\stata.pyc in __init__(self, path_or_buf, encoding)
    455             self.path_or_buf = path_or_buf
    456 
--> 457         self._read_header()
    458 
    459     def _read_header(self):

C:\Users\rferrer\AppData\Local\Enthought\Canopy\User\lib\site-packages\pandas\io\stata.pyc in _read_header(self)
    657 
    658         """Calculate size of a data record."""
--> 659         self.col_sizes = lmap(lambda x: self._calcsize(x), self.typlist)
    660 
    661     def _calcsize(self, fmt):

C:\Users\rferrer\AppData\Local\Enthought\Canopy\User\lib\site-packages\pandas\io\stata.pyc in <lambda>(x)
    657 
    658         """Calculate size of a data record."""
--> 659         self.col_sizes = lmap(lambda x: self._calcsize(x), self.typlist)
    660 
    661     def _calcsize(self, fmt):

C:\Users\rferrer\AppData\Local\Enthought\Canopy\User\lib\site-packages\pandas\io\stata.pyc in _calcsize(self, fmt)
    661     def _calcsize(self, fmt):
    662         return (type(fmt) is int and fmt
--> 663                 or struct.calcsize(self.byteorder + fmt))
    664 
    665     def _col_size(self, k=None):

TypeError: cannot concatenate 'str' and 'NoneType' objects

The dataset myauto.dta is just the auto dataset made available running sysuse auto within Stata.

The problem is originally documented here: http://stackoverflow.com/questions/24053652/pandas-and-stata-13-files.

My Python is set up with Enthough Canopy 1.4.0 (64 bit).

jreback commented 10 years ago

cc @bashtage

bashtage commented 10 years ago

Will try and look into this weekend. I don't have Stata 13 so I can't easily generate any data files, which makes debugging hard. Ideally there would be data files for the same dta files that are in the test data.

jreback commented 10 years ago

@refp16 where do you see this in the whatsnew?

cpcloud commented 10 years ago

i think he's referring to https://github.com/pydata/pandas/issues/4291 and #4662

refp16 commented 10 years ago

@jreback Just search for "GH4291" in http://pandas.pydata.org/pandas-docs/stable/whatsnew.html.

kdiether commented 10 years ago

This may be clear from error message but it looks like it happens with a stata 13 data file when there is a string column:

  ___  ____  ____  ____  ____ (R)
 /__    /   ____/   /   ____/
___/   /   /___/   /   /___/   13.0   Copyright 1985-2013 StataCorp LP
  Statistics/Data Analysis            StataCorp
                                      4905 Lakeway Drive
     MP - Parallel Edition            College Station, Texas 77845 USA
                                      800-STATA-PC        http://www.stata.com
                                      979-696-4600        stata@stata.com
                                      979-696-4601 (fax)

5-user 6-core Stata network perpetual license:

. insheet using foo.csv
(3 vars, 5 obs)

. list

     +---------------+
     | x1   x2    x3 |
     |---------------|
  1. |  1    5   'a' |
  2. |  2    4   'b' |
  3. |  3    3   'c' |
  4. |  4    2   'd' |
  5. |  5    1   'e' |
     +---------------+

. save all.dta, replace
(note: file all.dta not found)
file all.dta saved

. drop x3

. save just_numeric.dta, replace
(note: file just_numeric.dta not found)
file just_numeric.dta saved

Now reading them into pandas 0.14


In [6]: df = pd.read_stata('just_numeric.dta')

In [7]: df
Out[7]: 
   x1  x2
0   1   5
1   2   4
2   3   3
3   4   2
4   5   1

In [8]: df = pd.read_stata('all.dta')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-8c472969f1f6> in <module>()
----> 1 df = pd.read_stata('all.dta')

/usr/lib64/python2.7/site-packages/pandas/io/stata.pyc in read_stata(filepath_or_buffer, convert_dates, convert_categoricals, encoding, index)
     45         identifier of column that should be used as index of the DataFrame
     46     """
---> 47     reader = StataReader(filepath_or_buffer, encoding)
     48 
     49     return reader.data(convert_dates, convert_categoricals, index)

/usr/lib64/python2.7/site-packages/pandas/io/stata.pyc in __init__(self, path_or_buf, encoding)
    455             self.path_or_buf = path_or_buf
    456 
--> 457         self._read_header()
    458 
    459     def _read_header(self):

/usr/lib64/python2.7/site-packages/pandas/io/stata.pyc in _read_header(self)
    657 
    658         """Calculate size of a data record."""
--> 659         self.col_sizes = lmap(lambda x: self._calcsize(x), self.typlist)
    660 
    661     def _calcsize(self, fmt):

/usr/lib64/python2.7/site-packages/pandas/io/stata.pyc in <lambda>(x)
    657 
    658         """Calculate size of a data record."""
--> 659         self.col_sizes = lmap(lambda x: self._calcsize(x), self.typlist)
    660 
    661     def _calcsize(self, fmt):

/usr/lib64/python2.7/site-packages/pandas/io/stata.pyc in _calcsize(self, fmt)
    661     def _calcsize(self, fmt):
    662         return (type(fmt) is int and fmt
--> 663                 or struct.calcsize(self.byteorder + fmt))
    664 
    665     def _col_size(self, k=None):

TypeError: cannot concatenate 'str' and 'NoneType' objects
bashtage commented 10 years ago

Can you post the problematic dta somewhere? Dropbox link, or email me?

refp16 commented 10 years ago

@bashtage Try here:

http://www.evernote.com/shard/s11/sh/1a28ae6d-512c-4a21-8353-32bb7d5956d3/ab5b783d9711c5e3f800afe5e3ff3a5f

Those are two files. One problematic (with string variable), the other one fine.

bashtage commented 10 years ago

Thanks, got it.

refp16 commented 10 years ago

Just a note:

Long strings (or strL) were introduced in Stata 13. But notice that the string variable does not have to be an strL type (holds up to 2000000000 bytes) to generate the problem. In my previous comment the problematic file holds a simple str1 variable (holds 1 byte).

bashtage commented 10 years ago

This should go through and provides a fix so that all existing test files, in Stata 13 format, all pass. The strL data type is explicitly not supported and an error informing users is raised.

bashtage commented 10 years ago

@jreback any idea about this this failure on Travis? This issue appears to be unrelated to any changes I made.

jreback commented 10 years ago

@bashtage just fixed that, rebase on master and you should be good