pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.64k stars 17.58k forks source link

BUG: read_sas(..) interprets some zeros as 5.397605e-79 #50670

Open briandamaged opened 1 year ago

briandamaged commented 1 year ago

Pandas version checks

Reproducible Example

# Unfortunately, `urllib.request.urlretrieve(..)` is producing a 403 response
# when I attempt to automate the file download. So, you will need to manually
# download and unzip the data file from the following URL:
#
#   https://www.federalreserve.gov/econres/files/scf2019x.zip
#

import pandas as pd

# "./scf2019x" is the file that was extracted from "scf2019x.zip"
everything = pd.read_sas("./scf2019x", format = 'xport')
df = everything[['X3509']]

# Confirm that `0.0` never occurs
print(sum(df['X3509'] == 0.0))

# Demonstrate that `5.397605e-79` has several occurrences
df[df['X3509'] < 0.0001]

Issue Description

(Note: issue was originally reported in #30051 )

Occasionally, pandas.read_sas(..) interprets some "zero values" as 5.397605e-79.

This issue can also be observed in the 2019 Survey of Consumer Finances when using pandas==1.5.2:

https://www.federalreserve.gov/econres/files/scf2019x.zip

After you unzip the file, you can recreate the issue as follows:

import pandas as pd

# "./scf2019x" is the file that was extracted from "scf2019x.zip"
everything = pd.read_sas("./scf2019x", format = 'xport')
df = everything[['X3509']]

# Confirm that `0.0` never occurs
print(sum(df['X3509'] == 0.0))

# Demonstrate that `5.397605e-79` has several occurrences
df[df['X3509'] < 0.0001]

This produces the following output:

>>> df[df['X3509'] < 0.0001]
              X3509
0      5.397605e-79
1      5.397605e-79
2      5.397605e-79
3      5.397605e-79
4      5.397605e-79
...             ...
28865  5.397605e-79
28866  5.397605e-79
28867  5.397605e-79
28868  5.397605e-79
28869  5.397605e-79

[14252 rows x 1 columns]

According to the survey's documentation, these values were all intended to be equal to 0 .

Expected Behavior

These values should all be equal to 0.0 .

Installed Versions

INSTALLED VERSIONS
------------------
commit           : 66e3805b8cabe977f40c05259cc3fcf7ead5687d
python           : 3.8.10.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.4.0-136-generic
Version          : #153-Ubuntu SMP Thu Nov 24 15:56:58 UTC 2022
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.3.5
numpy            : 1.24.1
pytz             : 2022.7
dateutil         : 2.8.2
pip              : 20.0.2
setuptools       : 44.0.0
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyxlsb           : None
s3fs             : None
scipy            : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : None
briandamaged commented 1 year ago

FYI: Just confirmed that the issue is present in main as well.

briandamaged commented 1 year ago

I have a feeling that the issue is introduced at one of the following locations:

I'll hopefully have some time to poke around at it later today. In the meantime, here's a quick comparison of the differing "zero values":

>>> fake_zero
5.397605346934028e-79
>>> fake_zero.tobytes()
b'\x00\x00\x00\x00\x00\x00\xb0/'

>>> real_zero
0.0
>>> real_zero.tobytes()
b'\x00\x00\x00\x00\x00\x00\x00\x00'