mcaceresb / stata-parquet

Read and write parquet files from Stata
MIT License
22 stars 6 forks source link
arrow parquet stata

stata-parquet

Read and write parquet files from Stata (Linux/Unix only).

This package uses the Apache Arrow C++ library to read and write parquet files from Stata using plugins. Currently this package is only available in Stata for Unix (Linux).

version 0.6.5 22Oct2023 | Installation | Usage | Examples

Installation

You need to first install:

Installation with Conda

First, intall Google's logging library: libgoogle-glog-dev in Ubuntu, google-glog in Arch (you may have to link libglog.so to libglog.so.0), and so on. Then the only tested way to install this software is via conda (see here for installation instructions; most recent plugin installation and tests were conducted using Miniconda3 for Python 3.8, version 23.3.1):

git clone https://github.com/mcaceresb/stata-parquet
cd stata-parquet
conda env create -f environment.yml
conda activate stata-parquet

make SPI=3.0 GCC=${CONDA_PREFIX}/bin/x86_64-conda_cos6-linux-gnu-g++ UFLAGS=-std=c++11 INCLUDE=${CONDA_PREFIX}/include LIBS=${CONDA_PREFIX}/lib all
stata -b "net install parquet, from(${PWD}/build) replace"
rm -f stata.log

Note: If you have Stata 14.0 or earlier you will want to use SPI=2.0 instead.

Warning: The plugin uses a possibly dated version of parquet (specifically parquet-cpp version 1.5.1 and arrow-cpp version 0.14.1).

Usage

Usage with Conda

Activate the Conda environment with

conda activate stata-parquet

Then be sure to start Stata via

LD_LIBRARY_PATH=${CONDA_PREFIX}/lib:$LD_LIBRARY_PATH xstata

Alternatively, you could add the following line to your ~/.bashrc to not have to enter the LD_LIBRARY_PATH every time (make sure to replace ${CONDA_PREFIX} with the absolute path it represents):

export LD_LIBRARY_PATH=${CONDA_PREFIX}/lib:$LD_LIBRARY_PATH

Then just start Stata with

xstata

Examples

parquet save and parquet use will save and load datasets in Parquet format, respectively. parquet desc will describe the contents of a parquet dataset. For example:

sysuse auto, clear
parquet save auto.parquet, replace
parquet desc auto.parquet
parquet use auto.parquet, clear
desc

parquet use price make gear_ratio using auto.parquet, clear in(10/20)
parquet save gear_ratio make using auto.parquet in 5/6 if price > 5000, replace

Note that the if clause is not supported by parquet use. To test the plugin works as expected, run do build/parquet_tests.do from Stata. To also test the plugin correctly reads hive format datasets, run

conda install -n stata-parquet pandas numpy fastparquet
conda activate stata-parquet

Then, from Stata, do build/parquet_tests.do python

Limitations

TODO

Some features that ought to be implemented:

Some features that might not be implementable, but the user should be warned about them:

Improve:

License

stata-parquet is MIT-licensed.