jcushman / xport

[project has MOVED to https://github.com/selik/xport/]
MIT License
14 stars 5 forks source link

======== Xport


Python reader for SAS XPORT data transport files.

What's it for?

XPORT is the binary file format used by a bunch of United States government agencies_ for publishing data sets. It made a lot of sense if you were trying to read data files on your IBM mainframe back in 1988.

.. _United States government agencies: https://www.google.com/search?q=site:.gov+xpt+file

How do I use it?

Let's make this short and sweet::

import xport
with xport.XportReader(xport_file) as reader:
    for row in reader:
        print row

Each row will be a dict with a key for each field in the dataset. Values will be either a unicode string, a float or an int, depending on the type specified in the file for that field.

Getting file info

Once you have an XportReader object, there are a few properties and methods that will give you details about the file:

Random access to records

If you want to access specific records, instead of iterating, you can use Python's standard file access functions and a little math.

Get 1000th record::

reader.file.seek(reader.record_start + reader.record_length * 1000, 0)
reader.next()

Get record before most recent one fetched::

reader.file.seek(-reader.record_length * 2, 1)
reader.next()

Get last record::

reader.file.seek(reader.record_start + reader.record_length * (reader.record_count() - 1), 0)
reader.next()

(In this last example, note that we can't seek from the end of the file, because there may be padding bytes. Good old fixed-width binary file formats.)

Please fix/steal this code!

I wrote this up because it seemed ridiculous that there was no easy way to read a standard government data format in most programming languages. I may have gotten things wrong. If you find a file that doesn't decode propery, send a pull request. The official spec is here_. It's surprisingly straightforward for a binary file format from the 80s.

.. _The official spec is here: http://support.sas.com/techsup/technote/ts140.html

Please also feel free to use this code as a base to write your own library for your favorite programming language. Government data should be accessible, man.