selik / xport

Python reader and writer for SAS XPORT data transport files.
MIT License
49 stars 24 forks source link

RecursionError while reading an NHANES file #39

Closed sammosummo closed 4 years ago

sammosummo commented 4 years ago

Installed in a fresh conda environment:

# Name                    Version                   Build  Channel
ca-certificates           2020.1.1                      0
certifi                   2020.4.5.1               py38_0
click                     7.1.1                    pypi_0    pypi
libcxx                    4.0.1                hcfea43d_1
libcxxabi                 4.0.1                hcfea43d_1
libedit                   3.1.20181209         hb402a30_0
libffi                    3.2.1                h0a44026_6
ncurses                   6.2                  h0a44026_0
numpy                     1.18.3                   pypi_0    pypi
openssl                   1.1.1g               h1de35cc_0
pandas                    1.0.3                    pypi_0    pypi
pip                       20.0.2                   py38_1
python                    3.8.2                hc70fcce_0
python-dateutil           2.8.1                    pypi_0    pypi
pytz                      2019.3                   pypi_0    pypi
pyyaml                    5.3.1                    pypi_0    pypi
readline                  8.0                  h1de35cc_0
setuptools                46.1.3                   py38_0
six                       1.14.0                   pypi_0    pypi
sqlite                    3.31.1               h5c1f38d_1
tk                        8.6.8                ha441bb4_0
wheel                     0.34.2                   py38_0
xport                     3.1.2                    pypi_0    pypi
xz                        5.2.5                h1de35cc_0
zlib                      1.2.11               h1de35cc_3

Tried to convert one file to another via xport file1.xpt > file2.csv.

Got an enormous error traceback, ending with:

  File "/Users/samuelrobertmathias/miniconda3/envs/xport/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 141, in __init__
    self._consolidate_check()
  File "/Users/samuelrobertmathias/miniconda3/envs/xport/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 656, in _consolidate_check
    ftypes = [blk.ftype for blk in self.blocks]
  File "/Users/samuelrobertmathias/miniconda3/envs/xport/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 656, in <listcomp>
    ftypes = [blk.ftype for blk in self.blocks]
  File "/Users/samuelrobertmathias/miniconda3/envs/xport/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 349, in ftype
    return f"{dtype}:{self._ftype}"
  File "/Users/samuelrobertmathias/miniconda3/envs/xport/lib/python3.8/site-packages/numpy/core/_dtype.py", line 54, in __str__
    return dtype.name
  File "/Users/samuelrobertmathias/miniconda3/envs/xport/lib/python3.8/site-packages/numpy/core/_dtype.py", line 347, in _name_get
    if _name_includes_bit_suffix(dtype):
  File "/Users/samuelrobertmathias/miniconda3/envs/xport/lib/python3.8/site-packages/numpy/core/_dtype.py", line 326, in _name_includes_bit_suffix
    elif np.issubdtype(dtype, np.flexible) and _isunsized(dtype):
  File "/Users/samuelrobertmathias/miniconda3/envs/xport/lib/python3.8/site-packages/numpy/core/numerictypes.py", line 392, in issubdtype
    if not issubclass_(arg1, generic):
RecursionError: maximum recursion depth exceeded
selik commented 4 years ago

Rats. I'll take a look this evening.

selik commented 4 years ago

@sammosummo looks like a problem with that particular file, rather than the code dependencies. Are you able to share it?

sammosummo commented 4 years ago

That makes sense. Not at liberty to distribute them, but there are several here.

selik commented 4 years ago

I tried the top one alphabetically, Acculturation / ACQ_J, and couldn't reproduce the issue. Could you point me to one that failed?

While testing, I fixed a different issue (#40), so reporting this was helpful already.

bunk1978 commented 4 years ago

I've been playing with some of the files here: https://github.com/phuse-org/phuse-scripts/tree/master/data/send (./PointCross/lb.xpt has the following problem below)

I was trying to generate some larger volume files to test and using this to multiply records from an existing file and create a new XPT. I'm also hitting the recursion-depth error on dumping. Here is my code: with open(inputFile,'rb') as inFile: for dataset in library.items(): library=xport.Library({dataset[0]:dataset[1]}) with open(outputFile,'wb') as outFile: xport.v56.dump(library,outFile)

(just a test, taking the datasets and outputting them back into another file)

Results in the error: Traceback (most recent call last): File "gen-big-xpt.py", line 89, in xport.v56.dump(library,outFile) File "/usr/local/lib/python3.7/site-packages/xport/v56.py", line 907, in dump fp.write(dumps(library)) File "/usr/local/lib/python3.7/site-packages/xport/v56.py", line 926, in dumps return bytes(Library(library)) File "/usr/local/lib/python3.7/site-packages/xport/v56.py", line 706, in bytes b'members': b''.join(bytes(Member(member)) for member in self.values()), File "/usr/local/lib/python3.7/site-packages/xport/v56.py", line 706, in b'members': b''.join(bytes(Member(member)) for member in self.values()), File "/usr/local/lib/python3.7/site-packages/xport/init.py", line 470, in init self.copy_metadata(data)

.... the stack is huge....

data = self._format_data()

File "/usr/local/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 938, in _format_data self, self._formatter_func, is_justify=is_justify, name=name File "/usr/local/lib/python3.7/site-packages/pandas/io/formats/printing.py", line 318, in format_object_summary displaywidth, = get_console_size() File "/usr/local/lib/python3.7/site-packages/pandas/io/formats/console.py", line 16, in get_console_size display_width = get_option("display.width") File "/usr/local/lib/python3.7/site-packages/pandas/_config/config.py", line 231, in call return self.func(*args, **kwds) File "/usr/local/lib/python3.7/site-packages/pandas/_config/config.py", line 102, in _get_option key = _get_single_key(pat, silent) RecursionError: maximum recursion depth exceeded

bunk1978 commented 4 years ago

Hitting the same(or similar) recursive depth error with the following: `import pandas as pd import xport import xport.v56

datasets={} for i in range(1,10): values1=[] values2=[] for j in range(1,10000000): values1.append(j) values2.append('values'+str(i)) df = pd.DataFrame({ 'alpha'+str(i): values1, 'beta'+str(i): values2, })

ds = xport.Dataset(df, name='DATA'+str(i), label='Wonderful data '+str(i))
for k, v in ds.items():
    v.label = k               # Use the column name as SAS label
    v.name = k.upper()[:8]    # SAS names are limited to 8 chars
    if v.dtype == 'object':
        v.format = '$CHAR20.' # Variables will parse SAS formats
    else:
        v.format = '10.2'
datasets['DATA'+str(i)] = ds

library = xport.Library(datasets)

Libraries can have multiple datasets.

with open('example.xpt', 'wb') as f: xport.v56.dump(library, f) `

but this works fine: `import pandas as pd import xport import xport.v56

datasets={} for i in range(1,10): values1=[] values2=[] for j in range(1,10): values1.append(j) values2.append('values'+str(i)) df = pd.DataFrame({ 'alpha'+str(i): values1, 'beta'+str(i): values2, })

ds = xport.Dataset(df, name='DATA'+str(i), label='Wonderful data '+str(i))
for k, v in ds.items():
    v.label = k               # Use the column name as SAS label
    v.name = k.upper()[:8]    # SAS names are limited to 8 chars
    if v.dtype == 'object':
        v.format = '$CHAR20.' # Variables will parse SAS formats
    else:
        v.format = '10.2'
datasets['DATA'+str(i)] = ds

library = xport.Library(datasets)

Libraries can have multiple datasets.

with open('example.xpt', 'wb') as f: xport.v56.dump(library, f)`

After some quick testing, it seems to break at this cutoff: for j in range(1,61): to for j in range(1,62):

bunk1978 commented 4 years ago

Ignore my last two comments, I just found: def copy_metadata(self, other): """ Copy metadata from another Variable. """

LOG.debug(f'Copying metadata from {other}') # BUG: Causes infinite recursion!

Commenting out that line or pulling your latest fixes my problem.

selik commented 4 years ago

@bunk1978 Did I mess up the PyPI upload? It looks like it's synchronized with GitHub master.

selik commented 4 years ago

Looks like I did, in fact, accidentally leave that recursion bug in there. Fixed.

bunk1978 commented 4 years ago

Thanks! Sorry I didn't respond faster.

meain commented 3 years ago

Just a heads up for anyone who ends up here. Make sure that once you trim the chars to just 8 using something like below, you still have unique column names. Or you might hit a RecursionError: maximum recursion depth exceeded.

ds = ds.rename(columns={k: k.upper()[:8] for k in ds})
selik commented 3 years ago

@meain Doh! That's fun. Do you mind opening a new issue and pasting a traceback in there?

meain commented 3 years ago

@selik opened https://github.com/selik/xport/issues/61