pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.43k stars 17.85k forks source link

ENH: read_sas, to_sas #4052

Closed benjello closed 8 years ago

benjello commented 11 years ago

It would be really convenient to be able to at least import SAS tables into pandas dataframe. Is this planned ? Are they insurmountable issues ?

Thanks

jreback commented 11 years ago

can you post a link to the format? and see if any converters have been writtenin python?

obviously the idea would be to read the native format file

cpcloud commented 11 years ago

http://support.sas.com/techsup/technote/ts140.html

jreback commented 11 years ago

so looks like simple binary read/write stuff....could be done....only other question is there any license issue with doing this?

cpcloud commented 11 years ago

not sure. @benjello want to contact SAS and ask them?

miketkelly commented 11 years ago

That technote is describing the XPORT format, which isn't the native binary format. I've never seen a published layout of the native format. Some people have partially reverse engineered it, but I've never seen a solution that could handle data sets with compression.

https://github.com/BioStatMatt/sas7bdat

cpcloud commented 11 years ago

@mtkni thanks had no idea.

jreback commented 11 years ago

aside from using SAS to actually export (e.g. csv or whatever), is there aformat that one could save that provides some interoperbility (and is openish)?

cpcloud commented 11 years ago

a 10 minute search suggests no but maybe someone else knows more.

benjello commented 11 years ago

I am not a specialist much more a potential user in heavy need of such a tool. For now I have used alternatively StatTransfer which is not a free software or when importing to R I used one of the method exposed here but you need to have sas installed. I am sorry for not being able to provide ypu for more information than above.

cpcloud commented 11 years ago

probably could do

  1. call to sas if exists
  2. if doesn't exist can only read xport so try to read that (someone would need to implement the xport reader)
  3. if not 1 or 2 then fail saying u need sas to read sas binary files.
jreback commented 11 years ago

I don't think 1) is a good idea

export data in stata to xport format or csv format

cpcloud commented 11 years ago

ok. just throwing it out there. i don't like calling out to other programs either, but this seems like it's going to be tough. i can implement the R code above...if u think that's a good idea...but it basically forces users to use that particular version of the format and if it ever changes we won't know until it breaks.

cpcloud commented 11 years ago

i wouldn't be able to test to_sas though since i don't have sas

benjello commented 11 years ago

I would be glad to test everything that would do the job. I have sas.

miketkelly commented 11 years ago

I've spent some time on this in the past. These are my thoughts:

1) Reverse engineering the binary data set format is a difficult task and not a good priority for this project. Any solution, like the R solution, that doesn't deal with compressed data sets won't do me any good. I can't speak for others on that. I also worry about the licensing issues. 2) Given that, the only option for read_sas may be to use the sas executable to first export SAS to a format that can be read by Python. 3) The SAS transport format, as described in the tech note, is an option but it has some quirks. In particular, the native XPORT engine doesn't handle long variable names. It really hasn't evolved since SAS 6 (we're on SAS 9 now). The SAS workaround for long variable names is a set of macros that not everyone has installed. 4) I'm not convinced the XPORT format is a better option than just CSV. If it's dramatically faster it might be. I'll study that next week. 5) If the mechanism for reading SAS data sets turns out to be export to CSV and then using read_csv, then there's not much code to write that isn't SAS environment dependent. It just doesn't make for a good shared module.

As much as I wish there was a good solution to this, and as much as I'd be willing to help build it, I just don't think there is. I've built read_sas using CSV as an intermediate format. Obviously, this requires a SAS license. It takes very few lines to implement, but most of those lines are specific to our SAS environment and are not well-portable.

I will study the performance of XPORT vs CSV next week. If it's dramatically faster, then it may be worth the effort to implement. Even then, I'm not sure it's worth taking that on as part of the Pandas project. I would be interested in comments from other SAS users on that.

Just my two cents.

cpcloud commented 11 years ago

nice to hear from someone who tried to do this. FWIW i think it might be tough to beat CSV for speed, most of it is written C/Cython.

miketkelly commented 11 years ago

Agreed. The new, fast CSV was a game changer.

jreback commented 11 years ago

can sas export to HDF5?

miketkelly commented 11 years ago

No, it can't.

jreback commented 11 years ago

export in STATA format?

miketkelly commented 11 years ago

No, and if it did it would be an expensive add-on module. SAS is pretty good at reading from databases (http://www.sas.com/resources/factsheet/sas-access-factsheet.pdf), although each database platform is a separate license. I haven't found it good at all at writing to databases (it can, but it's slow). Other than that, interoperability doesn't seem to be part of their business model.

miketkelly commented 11 years ago

Oh wait, I may have spoken to soon. Apparently I can export to a stata file: http://support.sas.com/documentation/cdl/en/acpcref/63184/HTML/default/viewer.htm#a003102702.htm

Is STATA supported in Pandas? It would still requires a SAS license, but I can benchmark that versus CSV.

benjello commented 11 years ago

There is a read_stata that will be available in te coming version but already available on github

benjello commented 11 years ago

BTW, @mtkni I would be happy to look at the read_sas you implemented if you would share it ...

miketkelly commented 11 years ago

Just to close the loop on this, exporting to STATA requires an add-on for which I'm not licensed, so I can't benchmark it.

dramage1 commented 10 years ago

FYI, the XPT or transport format is a non-proprietary format that has no licensing issues and is the only format currently accepted by the Food and Drug Administration (FDA) for clinical trial data. Most pharmaceutical companies submit XPT format to the FDA. It would be nice to have a way to read these files in just like a csv file.

jtratner commented 10 years ago

@dramage1 Is "XPT" the same as "XPORT" above?

jtratner commented 10 years ago

Heyo - there's at least one Python package for reading XPT files - https://pypi.python.org/pypi/xport/0.1.0

dramage1 commented 10 years ago

Just what I needed, much appreciated.

jtratner commented 10 years ago

@dramage1 if you use this enough to want to write up a pandas wrapper for it, that could be a useful addition to pandas (depending on the stability of xport)

selik commented 10 years ago

@jtratner I worked on the xport library before. I can refactor xport to give a better API for use in a pandas read_xpt or borrow some code to include directly in pandas.

selik commented 10 years ago

@dramage1 let me know if the xport library is confusing or broken. I'll try to improve the docs and/or code.

jtratner commented 10 years ago

As I said over email - I'm glad that you're interested in working on this. Feel free to ping me if you have any pandas-related issues.

jreback commented 10 years ago

@benjello @selik any action on this?

selik commented 10 years ago

Not yet... check back in a couple weeks :-0

benjello commented 10 years ago

I am sorry but I won't be qualified enough but I am willing to test any code

selik commented 10 years ago

@benjello Could you give me a few test cases? I don't have SAS available to me. I'd like to get just a few tiny test files to make some unit tests.

jreback commented 10 years ago

@selik you are going to use the xport soln?

jreback commented 10 years ago

@jseabold you have thoughts on this?

jseabold commented 10 years ago

Not really anything to add beyond what's here. I'm sure it will be useful if XPT format is used places (yikes that's a terrible data policy re: FDA). I've been lucky enough to avoid SAS beyond coursework which required it.

selik commented 10 years ago

@jreback It makes sense to refactor the xport library to make it friendly as a dependency for a pandas.read_xpt().

jreback commented 10 years ago

@selik yes....prob best to simply incorporate it directly (with the licensening references / included) - see what we did with msgpack. Then you can modify and not introduce a dep.

I am not a license expert...but I think that the MIT license is compat with pandas BDS 3 clause

(you basically just copy the LICENSE to the LICENSES dir) and are good 2 go

dramage1 commented 10 years ago

@jseabold Regarding the FDA data format. They are beginning to realize that it is time to move forwar dand have a XML format pilot project proposal http://goo.gl/1xNiv8. The SAS transport (XPT) format is not going away anytime soon - the FDA moves at a snails pace implementing changes, so I glad to see you are working on this. Unfortunately, I am a complete newbie at python and can't help much. I could provide some sample data in XPT format if someone can explain how to upload it to GIT.

selik commented 10 years ago

@dramage1 You can email me the files if you'd like. I'm looking for a Rosetta stone for XPT and CSV. Or XPT and some other plain-text format. I think my email address is in my profile.

dramage1 commented 10 years ago

@selik Mike, I tried mike@selik.org and got an undeliverable message.

selik commented 10 years ago

@dramage1 That's not good. I wonder who else is having trouble emailing me. Mind sharing your email in your profile?

dramage1 commented 10 years ago

I updated my profile

spearsem commented 10 years ago

Even though it would be slower, would it be worthwhile to add pyodbc-based support for SAS?

selik commented 10 years ago

@spearsem It wouldn't necessarily be slower if SAS has some secret awesome algorithm for reading XPT files. That's how R reads XPT. But if you already have SAS, the best thing to do is read the file in SAS and save as CSV, not to read it directly from Python.

selik commented 10 years ago

BTW, I'm slowly moving along with xport. I think I'll have code ready for inclusion in pandas by end of April.