Closed refp16 closed 10 years ago
cc @bashtage
Will try and look into this weekend. I don't have Stata 13 so I can't easily generate any data files, which makes debugging hard. Ideally there would be data files for the same dta files that are in the test data.
@refp16 where do you see this in the whatsnew?
i think he's referring to https://github.com/pydata/pandas/issues/4291 and #4662
@jreback Just search for "GH4291" in http://pandas.pydata.org/pandas-docs/stable/whatsnew.html.
This may be clear from error message but it looks like it happens with a stata 13
data file when there is a string column:
___ ____ ____ ____ ____ (R)
/__ / ____/ / ____/
___/ / /___/ / /___/ 13.0 Copyright 1985-2013 StataCorp LP
Statistics/Data Analysis StataCorp
4905 Lakeway Drive
MP - Parallel Edition College Station, Texas 77845 USA
800-STATA-PC http://www.stata.com
979-696-4600 stata@stata.com
979-696-4601 (fax)
5-user 6-core Stata network perpetual license:
. insheet using foo.csv
(3 vars, 5 obs)
. list
+---------------+
| x1 x2 x3 |
|---------------|
1. | 1 5 'a' |
2. | 2 4 'b' |
3. | 3 3 'c' |
4. | 4 2 'd' |
5. | 5 1 'e' |
+---------------+
. save all.dta, replace
(note: file all.dta not found)
file all.dta saved
. drop x3
. save just_numeric.dta, replace
(note: file just_numeric.dta not found)
file just_numeric.dta saved
Now reading them into pandas 0.14
In [6]: df = pd.read_stata('just_numeric.dta')
In [7]: df
Out[7]:
x1 x2
0 1 5
1 2 4
2 3 3
3 4 2
4 5 1
In [8]: df = pd.read_stata('all.dta')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-8-8c472969f1f6> in <module>()
----> 1 df = pd.read_stata('all.dta')
/usr/lib64/python2.7/site-packages/pandas/io/stata.pyc in read_stata(filepath_or_buffer, convert_dates, convert_categoricals, encoding, index)
45 identifier of column that should be used as index of the DataFrame
46 """
---> 47 reader = StataReader(filepath_or_buffer, encoding)
48
49 return reader.data(convert_dates, convert_categoricals, index)
/usr/lib64/python2.7/site-packages/pandas/io/stata.pyc in __init__(self, path_or_buf, encoding)
455 self.path_or_buf = path_or_buf
456
--> 457 self._read_header()
458
459 def _read_header(self):
/usr/lib64/python2.7/site-packages/pandas/io/stata.pyc in _read_header(self)
657
658 """Calculate size of a data record."""
--> 659 self.col_sizes = lmap(lambda x: self._calcsize(x), self.typlist)
660
661 def _calcsize(self, fmt):
/usr/lib64/python2.7/site-packages/pandas/io/stata.pyc in <lambda>(x)
657
658 """Calculate size of a data record."""
--> 659 self.col_sizes = lmap(lambda x: self._calcsize(x), self.typlist)
660
661 def _calcsize(self, fmt):
/usr/lib64/python2.7/site-packages/pandas/io/stata.pyc in _calcsize(self, fmt)
661 def _calcsize(self, fmt):
662 return (type(fmt) is int and fmt
--> 663 or struct.calcsize(self.byteorder + fmt))
664
665 def _col_size(self, k=None):
TypeError: cannot concatenate 'str' and 'NoneType' objects
Can you post the problematic dta somewhere? Dropbox link, or email me?
@bashtage Try here:
Those are two files. One problematic (with string variable), the other one fine.
Thanks, got it.
Just a note:
Long strings (or strL
) were introduced in Stata 13. But notice that the string variable does not have to be an strL
type (holds up to 2000000000 bytes) to generate the problem. In my previous comment the problematic file holds a simple str1
variable (holds 1 byte).
This should go through and provides a fix so that all existing test files, in Stata 13 format, all pass. The strL data type is explicitly not supported and an error informing users is raised.
@jreback any idea about this this failure on Travis? This issue appears to be unrelated to any changes I made.
@bashtage just fixed that, rebase on master and you should be good
pandas v0.14.0 (May 31 , 2014) seems uncapable of importing Stata 13 datasets although according to this http://pandas.pydata.org/pandas-docs/stable/whatsnew.html, it should. Stata 12 files can be imported without problems.
The output of running this
follows:
The dataset
myauto.dta
is just theauto
dataset made available runningsysuse auto
within Stata.The problem is originally documented here: http://stackoverflow.com/questions/24053652/pandas-and-stata-13-files.
My Python is set up with Enthough Canopy 1.4.0 (64 bit).