Closed benjello closed 8 years ago
can you post a link to the format? and see if any converters have been writtenin python?
obviously the idea would be to read the native format file
so looks like simple binary read/write stuff....could be done....only other question is there any license issue with doing this?
not sure. @benjello want to contact SAS and ask them?
That technote is describing the XPORT format, which isn't the native binary format. I've never seen a published layout of the native format. Some people have partially reverse engineered it, but I've never seen a solution that could handle data sets with compression.
@mtkni thanks had no idea.
aside from using SAS to actually export (e.g. csv or whatever), is there aformat that one could save that provides some interoperbility (and is openish)?
a 10 minute search suggests no but maybe someone else knows more.
I am not a specialist much more a potential user in heavy need of such a tool. For now I have used alternatively StatTransfer which is not a free software or when importing to R I used one of the method exposed here but you need to have sas installed. I am sorry for not being able to provide ypu for more information than above.
probably could do
I don't think 1) is a good idea
export data in stata to xport format or csv format
ok. just throwing it out there. i don't like calling out to other programs either, but this seems like it's going to be tough. i can implement the R code above...if u think that's a good idea...but it basically forces users to use that particular version of the format and if it ever changes we won't know until it breaks.
i wouldn't be able to test to_sas though since i don't have sas
I would be glad to test everything that would do the job. I have sas.
I've spent some time on this in the past. These are my thoughts:
1) Reverse engineering the binary data set format is a difficult task and not a good priority for this project. Any solution, like the R solution, that doesn't deal with compressed data sets won't do me any good. I can't speak for others on that. I also worry about the licensing issues. 2) Given that, the only option for read_sas may be to use the sas executable to first export SAS to a format that can be read by Python. 3) The SAS transport format, as described in the tech note, is an option but it has some quirks. In particular, the native XPORT engine doesn't handle long variable names. It really hasn't evolved since SAS 6 (we're on SAS 9 now). The SAS workaround for long variable names is a set of macros that not everyone has installed. 4) I'm not convinced the XPORT format is a better option than just CSV. If it's dramatically faster it might be. I'll study that next week. 5) If the mechanism for reading SAS data sets turns out to be export to CSV and then using read_csv, then there's not much code to write that isn't SAS environment dependent. It just doesn't make for a good shared module.
As much as I wish there was a good solution to this, and as much as I'd be willing to help build it, I just don't think there is. I've built read_sas using CSV as an intermediate format. Obviously, this requires a SAS license. It takes very few lines to implement, but most of those lines are specific to our SAS environment and are not well-portable.
I will study the performance of XPORT vs CSV next week. If it's dramatically faster, then it may be worth the effort to implement. Even then, I'm not sure it's worth taking that on as part of the Pandas project. I would be interested in comments from other SAS users on that.
Just my two cents.
nice to hear from someone who tried to do this. FWIW i think it might be tough to beat CSV for speed, most of it is written C/Cython.
Agreed. The new, fast CSV was a game changer.
can sas export to HDF5?
No, it can't.
export in STATA format?
No, and if it did it would be an expensive add-on module. SAS is pretty good at reading from databases (http://www.sas.com/resources/factsheet/sas-access-factsheet.pdf), although each database platform is a separate license. I haven't found it good at all at writing to databases (it can, but it's slow). Other than that, interoperability doesn't seem to be part of their business model.
Oh wait, I may have spoken to soon. Apparently I can export to a stata file: http://support.sas.com/documentation/cdl/en/acpcref/63184/HTML/default/viewer.htm#a003102702.htm
Is STATA supported in Pandas? It would still requires a SAS license, but I can benchmark that versus CSV.
There is a read_stata that will be available in te coming version but already available on github
BTW, @mtkni I would be happy to look at the read_sas you implemented if you would share it ...
Just to close the loop on this, exporting to STATA requires an add-on for which I'm not licensed, so I can't benchmark it.
FYI, the XPT or transport format is a non-proprietary format that has no licensing issues and is the only format currently accepted by the Food and Drug Administration (FDA) for clinical trial data. Most pharmaceutical companies submit XPT format to the FDA. It would be nice to have a way to read these files in just like a csv file.
@dramage1 Is "XPT" the same as "XPORT" above?
Heyo - there's at least one Python package for reading XPT files - https://pypi.python.org/pypi/xport/0.1.0
Just what I needed, much appreciated.
@dramage1 if you use this enough to want to write up a pandas wrapper for it, that could be a useful addition to pandas (depending on the stability of xport)
@jtratner I worked on the xport
library before. I can refactor xport
to give a better API for use in a pandas
read_xpt
or borrow some code to include directly in pandas
.
@dramage1 let me know if the xport
library is confusing or broken. I'll try to improve the docs and/or code.
As I said over email - I'm glad that you're interested in working on this. Feel free to ping me if you have any pandas-related issues.
@benjello @selik any action on this?
Not yet... check back in a couple weeks :-0
I am sorry but I won't be qualified enough but I am willing to test any code
@benjello Could you give me a few test cases? I don't have SAS available to me. I'd like to get just a few tiny test files to make some unit tests.
@selik you are going to use the xport
soln?
@jseabold you have thoughts on this?
Not really anything to add beyond what's here. I'm sure it will be useful if XPT format is used places (yikes that's a terrible data policy re: FDA). I've been lucky enough to avoid SAS beyond coursework which required it.
@jreback It makes sense to refactor the xport
library to make it friendly as a dependency for a pandas.read_xpt()
.
@selik yes....prob best to simply incorporate it directly (with the licensening references / included) - see what we did with msgpack. Then you can modify and not introduce a dep.
I am not a license expert...but I think that the MIT license is compat with pandas BDS 3 clause
(you basically just copy the LICENSE to the LICENSES dir) and are good 2 go
@jseabold Regarding the FDA data format. They are beginning to realize that it is time to move forwar dand have a XML format pilot project proposal http://goo.gl/1xNiv8. The SAS transport (XPT) format is not going away anytime soon - the FDA moves at a snails pace implementing changes, so I glad to see you are working on this. Unfortunately, I am a complete newbie at python and can't help much. I could provide some sample data in XPT format if someone can explain how to upload it to GIT.
@dramage1 You can email me the files if you'd like. I'm looking for a Rosetta stone for XPT and CSV. Or XPT and some other plain-text format. I think my email address is in my profile.
@selik Mike, I tried mike@selik.org and got an undeliverable message.
@dramage1 That's not good. I wonder who else is having trouble emailing me. Mind sharing your email in your profile?
I updated my profile
Even though it would be slower, would it be worthwhile to add pyodbc-based support for SAS?
@spearsem It wouldn't necessarily be slower if SAS has some secret awesome algorithm for reading XPT files. That's how R reads XPT. But if you already have SAS, the best thing to do is read the file in SAS and save as CSV, not to read it directly from Python.
BTW, I'm slowly moving along with xport
. I think I'll have code ready for inclusion in pandas
by end of April.
It would be really convenient to be able to at least import SAS tables into pandas dataframe. Is this planned ? Are they insurmountable issues ?
Thanks