kshedden / datareader

Read binary SAS (SAS7BDAT) and Stata (dta) files in the Go (Golang) programming language. Also provides command line tools for working with these file formats.
BSD 3-Clause "New" or "Revised" License
22 stars 12 forks source link

Benchmarks? #6

Open kylebarron opened 5 years ago

kylebarron commented 5 years ago

Hello!

I just came across this package. I'm aware that Go is generally very fast. Do you happen to know how the read speeds for Stata files with this package compare to Stata or Python? Is the reading multithreaded?

Also, "simple column-oriented data container" caught my eye. I'm especially curious if this data structure is similar to one that can be written by the parquet-go package. Since Parquet is a column-oriented file format, I'm guessing that reading Stata files with your package and writing it with parquet-go could be much faster than my current code to do that in Python.

kshedden commented 5 years ago

@kylebarron I haven't done any benchmarks, as I was focused more on correctness. However I believe it is reasonably fast and I routinely use this to process 100's of GBs of data.

I have been involved with maintaining the pandas readers for SAS and Stata (although I was not the primary author for either). I have used the SAS reader much more than the Stata reader, and it is clearly much faster here (in Go) than it is in Python/Pandas. The Stata dta file format is more amenable to vectorized processing, which makes a big difference in Python, so the advantage of using Go might be less for Stata files compared to SAS files (SAS7BDAT is not at all friendly to vectorization).

The column-oriented data structure used here is modeled on Bcolz (https://github.com/Blosc/bcolz), though not interchangeable with it. I believe this is much simpler than Parquet. In any case, I evolved this into a different columnar container called Dstream (https://github.com/kshedden/dstream) which I actively develop and maintain. The container here is mainly for the internal use of the Stata and SAS readers in this package.

Regarding concurrency, the readers work through the SAS/Stata files in chunks, and each chunk has its own backing memory, so you can process one chunk while reading the next. The reading itself is not concurrent (I don't think doing this would help as it is IO-bound).

kylebarron commented 5 years ago

Thanks for all that information!