selik / xport

Python reader and writer for SAS XPORT data transport files.
MIT License
49 stars 24 forks source link

having issues while setting metadata #49

Closed alfred-b-chan closed 2 years ago

alfred-b-chan commented 4 years ago

Code extract ds = xport.Dataset(table_df, name='Data', label='Test data') for column, variable in ds.items(): if (a_condition): v.format="10.2"

with open('my_file.xpt', 'wb') as f: xport.v56.dump(ds,f)

That would give me an error when opening my_file.xpt. And a_condition is true when column is numeric.

Please assist and can you please have more examples on the export.

selik commented 4 years ago

@alfred-b-chan I'll take a look this weekend.

selik commented 4 years ago

Hi @alfred-b-chan, I couldn't reproduce your error. Could you paste the error message you saw here?

alfred-b-chan commented 4 years ago

Format is exposed as property. But the following way of setting will break. For instance, once file is generated, I cannot open from SAS system Viewer 9.1.

for column, variable in xport_dataset.items():

variable.label = 'my label'  # this works 

#target_type = column_metadata['target_type'].item()
target_length = 10
#
# if target_type.strip().lower() == "char":
#     variable_format = f'${target_length}.'
# else:
#     variable_format = f'{target_length}.'
# variable.format = variable_format   # this does NOT work 

object.__setattr__(variable, 'width', 10)  # this works

Also can you please expose all metadata as properties such as vtype, width, _format.

Thanks

selik commented 4 years ago

vtype, width, _format.

This module currently supports (or attempts to support) everything from SAS's published file format description for XPORT v5. For variable metadata, you can see a brief explanation of what the file format can contain in the code comments: https://github.com/selik/xport/blob/04aca9360707a120c9ed410b4fe081b046c61375/src/xport/v56.py#L51

I'm guessing by "vtype" you mean the variable type. That can only be either Numeric or Character, so I infer that from the Python data type. Does that not work for you?

You can set "width" by the length property. Or do you mean something other than the field length?

By "_format" do you mean the format or informat? Both of those are supported.

I cannot open from SAS system Viewer 9.1

Can you tell me the error message?

Unfortunately, I use a Mac and can't run SAS System Viewer. I know JMP provides a free 30-day trial, and it looks like I could create XPT files with JMP, so I could do some debugging with it for a short time.

What would really help is an example of a valid XPT file that has the format you'd like.

alfred-b-chan commented 4 years ago

thanks.. i end up converting the dataframe datatype before i call the xport.Dataset constructor.... btw, how to handle null? for instance, my column in dataframe is numeric and some rows are null.. i convert them to None

df[my_numeric_col] = df[my_numeric_col].where(pd.notnull(df[my_numeric_col]), None) # this works

When I call the xport.Dataset constructor, it seems those rows with None ends up being a dash instead of a period in SAS System viewer.

It seems a period is supposed to be a null in SAS System Viewer.. so what should i set in order to show a period (which indicates a null) in SAS System viewer? thanks

image

selik commented 4 years ago

You should use NaN instead of None for numeric nulls. I'm surprised you were even able to get a None in there, because Pandas usually converts those to NaN for a float column.

np.nan

alfred-b-chan commented 4 years ago

when debugging it, i could see the column 'empty' and it showed NaN (I use np.NaN)... but once i open the file, it shows a dash!.. any clue? or i have to use Nan from other library? obviously once the xport.Dataset is called with my dataframe data, some manipulations took place and my np.NaN row wouuld show up as a dash from UI.

Debugging b4 xport.Dataset is invoked image

After dataset is invoked and this is image

Output as seen in SAS System Viewer image

selik commented 4 years ago

Not a clue. Pandas uses NumPy's NaN value for floats, as far as I know. You shouldn't need to use a value from a different library. Someday soon I'll clear my task list and get a JMP trial to investigate these things. I originally took on this library to read XPT files. I was surprised to find many people want to write XPT format as well.

alfred-b-chan commented 4 years ago

I got it! (v56.py) image

That's a bug on line 839..... .it is supposed to be a period but it's now a underscore..

as a matter of fact, once i update to a period, everything works fine (ie showing period for null value)

selik commented 4 years ago

Awesome. Thanks! Want to send a pull request? Please update the appropriate test, if so.

selik commented 4 years ago

XPT-format and IEEE support so many NaN possible values, but Python / NumPy only uses 1 value. Judging from the IBM-to-IEEE code, it looks like a leading underscore is a valid NaN value. You're saying that the leading period is preferable in the display?

https://github.com/selik/xport/blob/04aca9360707a120c9ed410b4fe081b046c61375/src/xport/v56.py#L790

We could look through the discussion in the TS140 document for details.

https://support.sas.com/techsup/technote/ts140.pdf

alfred-b-chan commented 4 years ago

this is in compliance with sas standard. if it's missing (or null), it would show a period.

https://communities.sas.com/t5/SAS-Communities-Library/How-to-select-MISSING-SPACE-and-NULL-values-in-SAS-Customer/ta-p/242234

or can u pls expose a property? so going forward user of this library has the flexibility to choose what to populate when null

selik commented 4 years ago

Sounds good to me. The section in the XPORT format specification document lists ._ first, but it makes sense that a plain . representation is the default.

Missing values are written out with the first byte (the exponent) indicating
the proper missing values. All subsequent bytes are 0x00. The first byte is:

type byte
 ._  0x5f
 .   0x2e
 .A  0x41
 .B  0x42
 ....
 .Z  0x5a 
selik commented 4 years ago

flexibility to choose what to populate when null

That's going to be trickier, as it'll complicate the API. We could add a parameter to dump/dumps, but that'd change all NaNs. Since XPT allows 28 possible NaN values, each could be different. How would we represent them in Python, before writing out XPT?

alfred-b-chan commented 4 years ago

hmm.. then pls let me know once the fix is up... it's fine as long as it's not an underscore but a period

LECARROU commented 2 years ago

Hi, did you were able to fix this issue?

selik commented 2 years ago

@LECARROU I'll be working on this project over the winter holidays

selik commented 2 years ago

@alfred-b-chan @LECARROU I believe this . vs _ issue is fixed in the latest version on PyPI. I'm also adding the SAS "Special Missing Values" feature to allow writing ._, .A, .B, ..., .Z.