selik / xport

Python reader and writer for SAS XPORT data transport files.
MIT License
49 stars 24 forks source link

Support different text encodings #75

Closed gaineleanor closed 2 years ago

gaineleanor commented 2 years ago

Hello selik Can you consider adding coding adaptation options? Under normal circumstances, it is mainly variable names, variable labels, and character values may be other encoding I submitted the code for the reading part of the v8 format. I want to try to add encoding options. I have no choice but to get to the map function members=map(MemberV8.from_bytes, chunks),. Please help to see if there is any good way. I will update the code written in v8 format later when I have time. Please help review. thanks

gaineleanor commented 2 years ago

Because I am a Chinese developer, my English is poor, so there may be errors in the expression. Please forgive me.

selik commented 2 years ago

Wow! Thanks for this effort. Coincidentally, I'm working on this right now. I'll have time to review later in November.

gaineleanor commented 2 years ago

Hi I have submitted to add xptv8 format write function (without format and informat v9 format) section @selik

file = r'C:\Users\admin\Desktop\v8rock.v8xpt'
with open(file, 'rb') as f:
    library = xport.v89.load(f)
cc = next(iter(library.values()))
print(cc)

df = pd.DataFrame({
'alphaf7we8f46we1f': [10, 20, 30],
'beta': ['x', 'y', 'z'],
'beta323fdfs': ['x', 'y', 'z'],
})
ds = xport.Dataset(df, name='888', label='mydataset')
for k, v in ds.items():
    v.label = k + 'this is a label that'
library = xport.Library({'888': ds})
with open('v8rock.v8xpt', 'wb') as f:
    xport.v89.dump(library, f)
selik commented 2 years ago

I'll respond to this by the end of December.

selik commented 2 years ago

Let's talk about supporting v8/9 in #10 .

For text encoding:

Transport files that were created by SAS releases before 9.2 are not stamped with encoding values. ... The encoding of the character data is stamped in transport files that are created using SAS versions 9.2 and later.

https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/proc/p0s4baszwvxumqn1p4cxbvnrpn8p.htm

It's not clear how the text encoding is "stamped" in the file. If we had an example of a SAS Transport file stamped with a text encoding, we could implement it. Without that, I don't know if there's anything better to do than assume ISO-8859-1. You can always encode/decode to recover the original characters.

In [1]: '\N{snowman}'.encode().decode('ISO-8859-1')
Out[1]: 'â\x98\x83'
selik commented 2 years ago

Replying to: https://github.com/selik/xport/issues/10#issuecomment-1001299035

@gaineleanor I think a reasonable option would be to have a xport.encoding variable defaulted to 'ISO-8859-1' to stay consistent with current behavior. All the encode/decode could look up xport.encoding at runtime. Or, I could add encoding='ISO-8859-1' to every method. That might be "better" design, but a bigger change.

selik commented 2 years ago

We want to allow CP-1252 specifically. #30

dtboy1995 commented 2 years ago

the same issue (chinese chars)

selik commented 2 years ago

@dtboy1995 Are you trying to write or read XPT format with Chinese characters?

dtboy1995 commented 2 years ago

@selik thanks for you reply. i am try to write Chinese characters to v56. it throw errors image

import pandas as pd
import xport
import xport.v56

df1 = pd.read_csv('input.csv')
df2 = pd.read_csv('input.csv')

ds1 = xport.Dataset(df1, name='SPEC1', sas_os='X64_DS12', sas_version='9.4')
ds2 = xport.Dataset(df2, name='SPEC2', sas_os='X64_DS12', sas_version='9.4')

library = xport.Library({'SPEC1': ds1, 'SPEC2': ds2})

with open('output.xpt', 'wb') as f:
    xport.v56.dump(library, f)

print("done")
gaineleanor commented 2 years ago

@dtboy1995 编码不支持.

selik commented 2 years ago

89 implements a "beta" feature for this. Check it out and please give feedback.

with xport.v56._encoding(data='utf-8', metadata='Windows-1252'):
    bytestring = xport.v56.dumps(library)

with xport.v56._encoding(data='utf-8', metadata='Windows-1252'):
    library = xport.v56.loads(bytestring)
leegang commented 12 months ago

89因此实现了“测试版”功能。检查一下并请提供反馈。

with xport.v56._encoding(data='utf-8', metadata='Windows-1252'):
    bytestring = xport.v56.dumps(library)

with xport.v56._encoding(data='utf-8', metadata='Windows-1252'):
    library = xport.v56.loads(bytestring)

File "/usr/local/lib/python3.10/site-packages/xport/v56.py", line 1008, in dumps return bytes(Library(library)) File "/usr/local/lib/python3.10/site-packages/xport/v56.py", line 757, in bytes return self._bytes() File "/usr/local/lib/python3.10/site-packages/xport/v56.py", line 765, in _bytes b'members': b''.join(bytes(Member(member)) for member in self.values()), File "/usr/local/lib/python3.10/site-packages/xport/v56.py", line 765, in b'members': b''.join(bytes(Member(member)) for member in self.values()), File "/usr/local/lib/python3.10/site-packages/xport/v56.py", line 644, in bytes return self._bytes() File "/usr/local/lib/python3.10/site-packages/xport/v56.py", line 673, in _bytes header = bytes(MemberHeader.from_dataset(self)) File "/usr/local/lib/python3.10/site-packages/xport/v56.py", line 448, in bytes namestrs = b''.join(bytes(ns) for ns in self.values()) File "/usr/local/lib/python3.10/site-packages/xport/v56.py", line 448, in namestrs = b''.join(bytes(ns) for ns in self.values()) File "/usr/local/lib/python3.10/site-packages/xport/v56.py", line 217, in bytes text_encode(self, 'label', 40), File "/usr/local/lib/python3.10/site-packages/xport/v56.py", line 777, in text_encode bytestring = value.encode(TEXT_METADATA_ENCODING).ljust(n) UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)

CcLmL commented 8 months ago

89 implements a "beta" feature for this. Check it out and please give feedback.

with xport.v56._encoding(data='utf-8', metadata='Windows-1252'):
    bytestring = xport.v56.dumps(library)

with xport.v56._encoding(data='utf-8', metadata='Windows-1252'):
    library = xport.v56.loads(bytestring)

It worked for me, Thank you!!!