Read and write parquet files from Stata (Linux/Unix only).
This package uses the Apache Arrow C++ library to read and write parquet files from Stata using plugins. Currently this package is only available in Stata for Unix (Linux).
version 0.6.5 22Oct2023
| Installation | Usage | Examples
You need to first install:
First, intall Google's logging library: libgoogle-glog-dev
in Ubuntu, google-glog
in Arch (you may have to link libglog.so
to libglog.so.0
), and so on. Then the only tested way to install this software is via conda
(see here for installation instructions; most recent plugin installation and tests were conducted using Miniconda3 for Python 3.8, version 23.3.1
):
git clone https://github.com/mcaceresb/stata-parquet
cd stata-parquet
conda env create -f environment.yml
conda activate stata-parquet
make SPI=3.0 GCC=${CONDA_PREFIX}/bin/x86_64-conda_cos6-linux-gnu-g++ UFLAGS=-std=c++11 INCLUDE=${CONDA_PREFIX}/include LIBS=${CONDA_PREFIX}/lib all
stata -b "net install parquet, from(${PWD}/build) replace"
rm -f stata.log
Note: If you have Stata 14.0 or earlier you will want to use SPI=2.0
instead.
Warning: The plugin uses a possibly dated version of parquet (specifically parquet-cpp
version 1.5.1
and arrow-cpp
version 0.14.1
).
Activate the Conda environment with
conda activate stata-parquet
Then be sure to start Stata via
LD_LIBRARY_PATH=${CONDA_PREFIX}/lib:$LD_LIBRARY_PATH xstata
Alternatively, you could add the following line to your ~/.bashrc
to not have
to enter the LD_LIBRARY_PATH
every time (make sure to replace
${CONDA_PREFIX}
with the absolute path it represents):
export LD_LIBRARY_PATH=${CONDA_PREFIX}/lib:$LD_LIBRARY_PATH
Then just start Stata with
xstata
parquet save
and parquet use
will save and load datasets in Parquet
format, respectively. parquet desc
will describe the contents of a
parquet dataset. For example:
sysuse auto, clear
parquet save auto.parquet, replace
parquet desc auto.parquet
parquet use auto.parquet, clear
desc
parquet use price make gear_ratio using auto.parquet, clear in(10/20)
parquet save gear_ratio make using auto.parquet in 5/6 if price > 5000, replace
Note that the if
clause is not supported by parquet use
. To test the
plugin works as expected, run do build/parquet_tests.do
from Stata. To
also test the plugin correctly reads hive
format datasets, run
conda install -n stata-parquet pandas numpy fastparquet
conda activate stata-parquet
Then, from Stata, do build/parquet_tests.do python
strL
variables is not yet supported.Int96
variables is not supported, as is has no direct Stata counterpart..parquet
files (as
far as I can tell). The default behavior is to scan string columns
to get the largest string, but it can be time-intensive. Adjust this
behavior via strscan()
and strbuffer()
.Some features that ought to be implemented:
skip
for columns that are in non-readable formats?Some features that might not be implementable, but the user should be warned about them:
strL
variablesImprove:
stata-parquet
is MIT-licensed.