tidy-finance / website

This repository hosts the source code for the website tidy-finance.org
https://tidy-finance.org
Other
86 stars 49 forks source link

blog/using-duckdb-with-wrds #47

Closed christophscheuch closed 9 months ago

christophscheuch commented 1 year ago

Created folder for post and changed document header for blog format.

Render failed because of the following error:

Error: rapi_execute: Failed to run query
Error: HTTP Error: Failed to download extension "postgres_scanner" at URL "http://extensions.duckdb.org/v0.8.0/windows_amd64_rtools/postgres_scanner.duckdb_extension.gz"
Extension "postgres_scanner" is an existing extension.

Are you using a development build? In this case, extensions might not (yet) be uploaded.

I could not get the postgres scanner working for isolated test databases. Any hints? Otherwise, I'll create an issue with the duckdb developers.

christophscheuch commented 1 year ago

@iangow welcome as one of the first Tidy Finance contributors đź‘‹

The idea is that @voigtstefan, @patrick-weiss and I will review your blog post and maybe ask you to clarify a few things. However, I guess there won't be many comments as your post is quite mature at this stage :)

I'll try to resolve the duckdb issues with R that I stated above asap. The weird thing is that for Python the scan pushdown to duckdb is working. Do you have some experience with the error I posted above?

iangow commented 1 year ago

Created folder for post and changed document header for blog format.

Render failed because of the following error:

Error: rapi_execute: Failed to run query
Error: HTTP Error: Failed to download extension "postgres_scanner" at URL "http://extensions.duckdb.org/v0.8.0/windows_amd64_rtools/postgres_scanner.duckdb_extension.gz"
Extension "postgres_scanner" is an existing extension.

Are you using a development build? In this case, extensions might not (yet) be uploaded.

I could not get the postgres scanner working for isolated test databases. Any hints? Otherwise, I'll create an issue with the duckdb developers.

A little beyond my ken, I'm afraid. There is a once-per-instance install postgres_scanner step that is needed, but your message is consistent with this already haven't been taken.

iangow commented 1 year ago

@iangow welcome as one of the first Tidy Finance contributors đź‘‹

The idea is that @voigtstefan, @patrick-weiss and I will review your blog post and maybe ask you to clarify a few things. However, I guess there won't be many comments as your post is quite mature at this stage :)

I'll try to resolve the duckdb issues with R that I stated above asap. The weird thing is that for Python the scan pushdown to duckdb is working. Do you have some experience with the error I posted above?

I suspect a little more detail on what DuckDB is about might be useful in the blog post. I view this as "advanced" material in that it's probably more relevant for someone who has gone through (relevant parts of) Tidy Finance and is looking to performance ways to build those ideas into their research workflow.

TBH, I only discovered DuckDB recently, but it is easy to swap in for the SQLite backend used in Tidy Finance. (I have used PostgreSQL with R since before dbplyr even existed.)

I believe (but this is a guess TBH) that DuckDB is installed separately for each "client API". So Python has a separate installation from R.

christophscheuch commented 1 year ago

FYI I created an issue with DuckDB devs: https://github.com/duckdb/duckdb/issues/7892

I want this issue to be resolved before we publish the article because I fear that people might not be able to replicate the promises of DuckDB and I don't want them to turn away disappointed

iangow commented 1 year ago

@christophscheuch I did a little editing (in my own repository) of the blog post today (along lines discussed above). But I guess we are still waiting on resolution of https://github.com/duckdb/duckdb/issues/7892.

christophscheuch commented 1 year ago

@christophscheuch I did a little editing (in my own repository) of the blog post today (along lines discussed above). But I guess we are still waiting on resolution of duckdb/duckdb#7892.

Could you also push those changes here on the branch? You should have the corresponding permissions.

Indeed, we are still waiting for a resolution of the bug because people would not be able to replicate the blog post. Alternatively, we could try an old duckdb version and add a note that points out my issue in the post - which version do you have installed @iangow?

christophscheuch commented 1 year ago

@iangow which duckdb version do you use? I could try with your version and add a corresponding note to the blog post. I still can't make it work via R on Windows and from the last response that I received on my issue it seems there are no immediate plans to change that (because the compiler is "esoteric").

iangow commented 1 year ago

@iangow which duckdb version do you use? I could try with your version and add a corresponding note to the blog post. I still can't make it work via R on Windows and from the last response that I received on my issue it seems there are no immediate plans to change that (because the compiler is "esoteric").

I think it's at 0.8.1 now. I only have easy access to MacOS and Linux. No issues there. I'm happy to sit on this until the Windows issue is resolved. (I guess the "good news" is that this issue affects all the DuckDB extensions—including the spatial and sqlite ones—so some chance someone will be motivated to address it.)

It looks like DuckDB increments minor versions every three months or so, so perhaps 0.9.0 will come in September.

iangow commented 1 year ago

@christophscheuch I did a little editing (in my own repository) of the blog post today (along lines discussed above). But I guess we are still waiting on resolution of duckdb/duckdb#7892.

Could you also push those changes here on the branch? You should have the corresponding permissions.

Indeed, we are still waiting for a resolution of the bug because people would not be able to replicate the blog post. Alternatively, we could try an old duckdb version and add a note that points out my issue in the post - which version do you have installed @iangow?

I will push the changes here when I next look at this. I had to make a few changes to reflect changes in the underlying DuckDB, so it might make sense to tackle this then when the Windows issue is addressed. Hopefully the underlying changes in DuckDB slow down as it converges to 1.0.0 (so that the blog post has a longer half-life).

(Note that the latest version of the note [on which the blog post is based] I have is a single Quarto file here.)

christophscheuch commented 1 year ago

Out of curiosity, I just installed duckdb version 0.7.0 and 0.6.0 to check whether postgres_scanner works there. I think that I can at least install it, but when I run dbExecute(tidy_finance, "LOAD postgres_scanner"), my R session always crashes :( So I also support waiting for future versions..

iangow commented 1 year ago

Out of curiosity, I just installed duckdb version 0.7.0 and 0.6.0 to check whether postgres_scanner works there. I think that I can at least install it, but when I run dbExecute(tidy_finance, "LOAD postgres_scanner"), my R session always crashes :( So I also support waiting for future versions..

0.9.0 just landed (not yet on CRAN). So it might be worth trying that out. (I couldn't see anything explicitly addressing the Windows issues in the news for that release, but who knows?)

christophscheuch commented 11 months ago

@iangow fyi we recently changed the download of crsp_daily to a batched job, so now it also just takes 5 minutes using SQLite. We can proceed with the blog post as you changed it to compare duckDB with other approaches for aggregation if you want to, rather than using its power for downloading data.

iangow commented 11 months ago

OK. I tweaked my version of the blog post to reflect this change. See PDF here (source code). The download time fluctuates a bit for me: sometimes under 3 minutes, sometimes over. The main benefit might be that the code is much simpler.

Apart from removing postgres_scanner, I also played around with parquet files and also Python. There appears to be merit in storing data in parquet format even if using DuckDB. With Python, one can see big performance jumps so long as one uses pyarrow.parquet rather than Pandas. Though there are annoying language differences (e.g., group_by versus groupby and aggregate versus agg) and I am not sure how elaborate one can get with that library in terms of computations. (In this regard, I think dbplyr has no close parallel in Python.)

My reason for using parquet files was that DuckDB database format is too unstable at this point. For the approach you use in the book (basically SQLite is a pretty "dumb" data store), I think parquet has a lot of merit.

iangow commented 11 months ago

As an aside, I will eventually have some material on parquet files and DuckDB here. In the rest of the book, I use data from either WRDS or small data sets in my farr package (available on CRAN). But in this chapter I dig a bit into EDGAR data and also some data from Loughran and MacDonald.

Another illustration of parquet/DuckDB is found here. Basically I dumped the data from the PostgreSQL database to parquet files to make the paper "fully reproducible" from GitHub. (Alas some database tables were updated after publication, so the numbers don't tie exactly in some tables in the paper.)

christophscheuch commented 11 months ago

Did you ever check out polars? Seems to me that is in a tight competition with other fast data libraries and it also can handle parquet files: https://pola-rs.github.io/polars/.

We will nonetheless stick to SQLite because it is very popular and easily available for academics and industry readers (some companies have restrictions on allowed packages, in particular with Python). Moreover, we deliberately emphasize simplicity over performance because saving a few minutes here and there is typically not the biggest issue in research or education.

iangow commented 11 months ago

Did you ever check out polars? Seems to me that is in a tight competition with other fast data libraries and it also can handle parquet files: https://pola-rs.github.io/polars/.

Looks interesting. I feel the difference with dbplyr is the ability to work with all sorts of data (SQLite, PostgreSQL, parquet, CSV) with a single interface. There doesn't seem to be quite the equivalent of that in Python.

We will nonetheless stick to SQLite because it is very popular and easily available for academics and industry readers (some companies have restrictions on allowed packages, in particular with Python). Moreover, we deliberately emphasize simplicity over performance because saving a few minutes here and there is typically not the biggest issue in research or education.

Maybe in a few years, parquet will be even more standard than CSV. (See R4DS on this.) Fortunately, I think it would be easy to switch over Tidy Finance then (much as I did with moving our paper from PostgreSQL to parquet; see here).

I think the benefit of "remote" tables comes more in quotidian "quality of life" issues when exploring data (fractions of a second versus seconds). But I fully recognise the trade-offs faced in writing a book. We use WRDS PostgreSQL much more in our book than I would recommend in "real life". I'm not even sure if I'm pushing things too far bringing in parquet/DuckDB for one chapter.

(I did read somewhere that "Stata[?!] is better for big data than R" … so at least the parquet/DuckDB material could serve to dispel that myth … I think it would be a real pain to do that work in Stata.)

iangow commented 11 months ago

Now that we seem to have a path forward, I will look to finalise the blog post in a couple of weeks.

iangow commented 10 months ago

OK. I just pushed the revised blog post (this has no requirement for postgres_scanner). I am not sure how to get all the machinery for this site working (seems you use renv and so on and I don't want to be updating packages that other parts of the site depend on). I think that duckdb and arrow would be the only incremental packages I use. For me, compiling this from scratch is about 5 minutes (most of the time in the steps with system_time() attached to them).

Note that I include some Python code at the end, but the output is hard-coded there. This is really to illustrate that the same performance differences carry over to Python.

The blog post is copy-pasted from here. Only the YAML data at the top differs. The compiled PDF is available here.

Some of the benchmark numbers will move around when compiled on the system you used for your blog. But hopefully, the broad strokes remain correct. The one benchmark that seems particularly unstable is the "download CRSP daily" one … this can be 150 seconds or 210 seconds even with 1Gbps wired internet held constant. But I don't really discuss that benchmark.

iangow commented 10 months ago

Related to this blog post, you may find a couple of recent LinkedIn posts of interest.

iangow@Ians-MacBook-Pro notes % time quarto render era_python.qmd --to pdf --quiet 
quarto render era_python.qmd --to pdf --quiet  11.27s user 3.30s system 68% cpu 21.295 total
iangow commented 10 months ago

This might work as a kind of follow-up note illustrating features of DuckDB.

christophscheuch commented 9 months ago

@iangow I reviewed your post again and implemented a few small changes. In particular, I removed the last paragraph that introduces Python as I think it digresses too much from the main point of using DuckDB with R. In addition, I added a thumbnail image. If you approve my changes, I'm happy to merge this PR and finally publish your post :)

iangow commented 9 months ago

Sure. That makes sense. I could make a separate entry for Python with DuckDB (there are some qualitative differences there).

I did see the following (not sure if this is intentional):

<div class="listing-author">
Ian Gow Scheuch
christophscheuch commented 9 months ago

Sure. That makes sense. I could make a separate entry for Python with DuckDB (there are some qualitative differences there).

I did see the following (not sure if this is intentional):

<div class="listing-author">
Ian Gow Scheuch

Good catch, this came out of a copy-past error, but I already corrected it in a follow-up commit :)