Closed sterlinm closed 1 year ago
A stata
.dta
file with zero rows still has type information, but when you try to read an empty.dta
file usingpd.read_stata
all of the columns have object type.
I think this occurs for other file types as well. I tried it for .csv and .xlsx file types and the same thing occurred.
The second part of this bug did not occur in these file types. So I am guessing it's an issue with the method
take
Could you check your expected? I think you meant that the dtype should be something else than object?
csv
files do not have dtype information, so this is expected. Not sure what we would want to do with excel files though
csv
files do not have dtype information, so this is expected. Not sure what we would want to do with excel files though
I'm not sure if Excel files have any implicit type but I don't think so. I could check and see what happens with parquet and SAS files. I think both of those still have type information even if there are no observations.
Could you check your expected? I think you meant that the dtype should be something else than object?
@sterlinm you wrote in the OP
Expected Behavior
In the above example
df2.dtypes
should return:In [2]: df2.dtypes Out[2]: a object b object dtype: object
you meant
a int32
dtype: object
@simonjayhawkins You're right, I mixed it up because I was highlighting two separate issues with reading empty files:
pd.read_stata
ignores the columns parameter when reading an empty file.pd.read_stata
loses dtype information when reading an empty file.Here's an updated example:
import numpy as np
import pandas as pd
from pandas.io.stata import StataReader
# create a DataFrame with int32 and float64 dtypes
df = pd.DataFrame(data={"a": range(3), "b": [1.0, 2.0, 3.0]})
df.loc[:, 'a'] = df['a'].astype('int32')
df_empty = df.head(0)
# write the empty and non-empty DataFrame's to .dta files
df.to_stata('nonempty.dta', write_index=False, version=117)
df_empty.to_stata('empty.dta', write_index=False, version=117)
# column variables
expected_cols = pd.Index(['a'])
all_cols = df.columns
# reading one column of non-empty .dta file works
assert pd.read_stata('nonempty.dta', columns=["a"]).columns.equals(expected_cols)
# reading one column of empty .dta file does not work
assert pd.read_stata('empty.dta', columns=["a"]).columns.equals(all_cols)
assert pd.read_stata('empty.dta', columns=["xyz"]).columns.equals(all_cols) # should raise error
# reading non-empty .dta file retains correct dtypes
assert pd.read_stata('nonempty.dta').dtypes.equals(df.dtypes)
# reading empty .dta file makes all the columns object columns
assert (pd.read_stata('empty.dta').dtypes == 'object').all()
# we can confirm that the empty .dta file does retain the type information
expected_dtyplist = [np.dtype('int32'), np.dtype('float64')]
assert StataReader('nonempty.dta').dtyplist == expected_dtyplist
assert StataReader('empty.dta').dtyplist == expected_dtyplist
In the above example pd.read_stata('empty.dta').dtypes
should return:
In [2]: df2.dtypes
Out[2]:
a int32
b float64
dtype: object
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[x] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
A stata
.dta
file with zero rows still has type information, but when you try to read an empty.dta
file usingpd.read_stata
all of the columns have object dtype. It will also ignore thecolumns
parameter and read all of the columns.Expected Behavior
In the above example
df2.dtypes
should return:Installed Versions
Apologies,
pd.show_versions()
fails for some reason. I've included it, but the pandas version is 1.4.1.