selik / xport

Python reader and writer for SAS XPORT data transport files.
MIT License
49 stars 24 forks source link

Unnecessary type conversions when executing xport.v56.dump #93

Open bolrDK opened 2 years ago

bolrDK commented 2 years ago

Python version: 3.9.12 xport version: 3.6.1

When I execute xport.v56.dump with a dataframe where all character variables are 'string', a type conversion of all these column from string to string is done - reported as a warning per variable: Converting column column name from string to string

I have wondered why these unnecessary conversions are made (I commented it in issue #70 ). I have now had time to dig into the code - and found the reason. In v56.py, line 648-659, it is decided which columns need to be converted:

    dtype_kind_conversions = {
        'O': 'string',
        'b': 'float',
        'i': 'float',
    }
    dtypes = self.dtypes.to_dict()
    conversions = {}
    for column, dtype in dtypes.items():
        try:
            conversions[column] = dtype_kind_conversions[dtype.kind]
        except KeyError:
            continue

The problem is that the dtype.kind is 'O' even though the data is string. I tried to change the code in this way: for column, dtype in dtypes.items(): try: if dtype_kind_conversions[dtype.kind] != dtype: conversions[column] = dtype_kind_conversions[dtype.kind] except KeyError: continue I.e, I have added the line if dtype_kind_conversions[dtype.kind] != dtype: And then it writes data to the XPT files with no conversions of the string variables.

It would be nice if the current unnecessary type conversions could be eliminated in the future.

kiranmohite6004 commented 1 year ago

i agree opened the request for clarifying my queries here https://github.com/selik/xport/issues/106 But i still have a doubt why it is assumed that data will of these 3 types only , I meant i have came across the scenarios where this module is not supporting the auto conversion for the datetime64[ns] . Let me know if there any way to get the understanding on this. Documentation seems to be limited. bolrDK