pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.43k stars 17.85k forks source link

Read and write spss data format #5768

Closed benjello closed 5 years ago

benjello commented 10 years ago

It would be nice to be able to import spss dant with read_spss and export it using to_spss.

benjello commented 10 years ago

The following package might be useful https://www.ibm.com/developerworks/community/files/app?lang=en#/person/270003ERMT/file/b77a0da0-2f47-454b-b505-5404b242d78c

ghost commented 10 years ago

Does SPSS offer no export formats compatible with pandas? Or, Is there relevant data contained in the proprietary file format which can't otherwise be accessed?

This may still make sense if for example users without access to SPSS frequently get SPSS files (AKA. the microsoft word problem). Not sure if that's the case though.

benjello commented 10 years ago

Actually, I do not use SPSS. But I have some SPSS data files that I want to explore. I may work with R but the data is so huge that this is not possible. The only way I did find is to tideous. I have to use PSPP to prepare the subset and the import it to R. With all the back and forth to add some variables and to master the SPSS syntax. Since pandas can deal with huge datasets, I do think it should provide import from SPSS. And I am willing to test it.

ghost commented 10 years ago

That was a good enough reason for stata support, yeah.

Note that this particular package weighs in at 90-180MB, includes a large chunk of binary proprietary (yet free to use) code and the pypi version doesn't work on linux 64bit (though the bug was fixed two months ago on bitbucket), so It's maturity is slightly suspect.

Basic usage doesn't require any explicit pandas cooperation:

import savReaderWriter as s
df= pd.DataFrame(list(s.SavReader('foo.sav')))

is all it takes.

There's something to be said for pandas accepting data rather then data formats. If there's a package that reads format X and produces data, then pandas implicitly supports format X.

Definitely worth a FAQ entry though, I'm sure other users have this need. Care to write some prose and make a PR?

benjello commented 10 years ago

Thank you very much for looking through this problem. I tried to import some .sav data into a pandas dataframe as you did but I ended having the following error on a win64: WindowsError: [Error 193] %1 is not a valid Win32 application in Python

benjello commented 10 years ago

Sorry @y-p This seems to be a reported error. I am using a 64bit python on a 64bit machine which is the case that is problematic according to this discussion https://bitbucket.org/fomcl/savreaderwriter/issue/12/win64-error

kmfolgar commented 7 years ago

Hi All, any progress on this topic?
I was searching a lot about this and dont found any answer so I use this small code to import sav's to pandas but only works in Python 3.5

import pandas as pd
import numpy as np
import savWriterReader as spss

with spss.SavReaderNp ("some_sav_file.sav") as reader:
    records = reader.all()
df = pd.DataFrame(records)
df.head()

Hope to work to someone.

Have a great day!

ozak commented 6 years ago

Lots of data is made available in SPSS so this tool would be very useful, especially for social scientists and economists. If the solution of @ThinkOnData works, it seems it should be an easy improvement. I will try it out and may submit a PR.

jukkahuhtamaki commented 6 years ago

I second the usefulness of read_spss and to_spss. I am currently entering a collaboration with a team of Information Systems scholars that use SPSS and UI-based Structural Equation Modeling tools.

Using SPSS to manage the master survey data seems to be a common approach (cf., Gaskin, 2016).

As the first step in introducing a computational approach to the collaboration, I am writing a script that preprocesses the survey data we have collected. Being able to easily read and write SPSS would certainly be helpful here.

jorisvandenbossche commented 6 years ago

For those interested in this issue, I think contributing better pandas support (or suggesting it) to the https://bitbucket.org/fomcl/savreaderwriter would be a good first step. If the package would have direct support to read a file into a DataFrame, and advertise this, I think it would already help a lot of people without needing to directly add it to pandas itself.

alexwbakker commented 6 years ago

SPSS is the only file format that is exportable from common survey tools like Qualtrics, SurveyGizmo, and SurveyMonkey that allows you to preserve both the values and the labels for many variables.

Survey data seems to always be represented as one row per response and one column per questions or question choice. For single select questions or Open ended , the values are usually coded as a single number per choice, so a column may have a 1, 2, or 3 for male, female, or 'prefer not to state'. If you can multi-select, most survey tools export questions like Q2_1, Q2_2, Q2_3... for each of the possible choices, and then each cell has a 1 for Selected and a null/SYSMIS/NaN value if it was not selected. Sometimes, that missing data is also coded as a -99 or other values.

Finally, SPSS has 2 properties, VARIABLE LABELS and VALUE LABELS that contain the Strings that tend to correspond to question text/choice text.

If you export a survey file as CSV from any of those tools, you are presented with the choice to either take the strings from each questions (e.g. "Strongly Agree" will be what is in each cell where that was selected) or, you can have a '5'. the trouble is that for many types of analysis you want both. In pandas, I think it would be easy to treat all of these as category labels with pd.categorical.

The common use case for this is to see what the average scale rating is for a question - e.g. in a Disagree <---> Agree scale is to get a mean response / st dev. But, you may also want to produce a cross tab that shows you count/percent for each category column. SPSS can do this pretty well, but, it is expensive, slow, and has a really lame internal syntax language.

If pandas had full support for SPSS files, and could write them out, it would be very helpful for doing both initial data exploration, question aggregation, data restructuring, text analysis, and then write out the new file to leverage other downstream tools like Wincross for reporting/analysis that require SPSS files and are easy for business/non-techncial people to use.

As someone that deals with a lot of survey data, I'm happy to talk/chat/answer any questions I can about this, and to test anything out in SPSS if it would help anyone. Note that I'm a novice when it comes to python/pandas, but I've been using SPSS for a long time and am looking to move away from it completely.

ofajardo commented 6 years ago

I have written a wrapper for the C library Readstat named pyreadstat which reads SPSS sav, zsav and por files: github.com/Roche/pyreadstat

cbrnr commented 5 years ago

It would be great if this functionality was available directly from Pandas, e.g. via read_spss. @TomAugspurger @jreback @jorisvandenbossche (sorry for the explicit mentions, I don't know how to at the whole dev team) would this be an option (given that this requires a C lib)? @ofajardo would you be willing to merge your code?

TomAugspurger commented 5 years ago

It’s more likely that we would have an optional dependency on that package that a read_spss would use. Similar to what we do with pyarrow and parquet.

On May 22, 2019, at 04:20, Clemens Brunner notifications@github.com wrote:

It would be great if this functionality was available directly from Pandas, e.g. via read_spss. @TomAugspurger @jreback @jorisvandenbossche (sorry for the explicit mentions, I don't know how to at the whole dev team) would this be an option (given that this requires a C lib)? @ofajardo would you be willing to merge your code?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ofajardo commented 5 years ago

@cbrnr @TomAugspurger I am willing to help and contribute

cbrnr commented 5 years ago

Great! So this could be as simple as adding pandas/io/spss.py and wrapping relevant functions of pyreadstat (and making sure to import it only within functions and not globally). Since only a data frame should be returned, some effort is probably be necessary in using the meta information for creating suitable column names, data types, and so on.

Also,pyreadstat can also read SAS and Stata files, which Pandas already supports natively (but pyreadstat is much faster). I don't know how you would like to handle these file formats (ignore for now, create separate modules/functions or integrate with existing readers), but I think for now just adding support for SPSS files would be a good plan.

TomAugspurger commented 5 years ago

Also,pyreadstat can also read SAS and Stata files, which Pandas already supports natively (but pyreadstat is much faster). I don't know how you would like to handle these file formats (ignore for now,

I think ignore for now, but we can certainly revisit once we have spss taken care of. Once there's interest we can add an engine keyword to read_sas and read_stata.

allefeld commented 2 years ago

This issue shouldn't have been closed, since https://github.com/pandas-dev/pandas/pull/26537 covers only the reading part.

jreback commented 2 years ago

there is no write support anywhere AFAIK -