rOpenSpain / spanishoddata

Access national high-quality and open-access datasets on movement patterns derived from mobile telephone datasets
https://ropenspain.github.io/spanishoddata/
Other
21 stars 0 forks source link

simplify duckdb helpers that create tables from CSV files #91

Open e-kotov opened 2 weeks ago

e-kotov commented 2 weeks ago

Just thinking out loud here.

Currently we have an awesome package structure where a lot of modularity, flexibility, but also conciseness of some functions come from a set of .sql files that actually do most of the heavy-lifting. These same files also enable us to make the package multilingual, as we can have a separate set of .sql files for a particular language and magically get tables translated on the fly into any language without even touching the R code.

However, there are still quite a lot of .sql files. That is because currently we have a single .sql file per significant action, such as creating mapping a folder of csv files into a DuckDB table, creating ENUMs, creating a clean table. For each spatial granularity we have a separate set of such files. So we have a lot of these. Internally, because datasets are a bit different, we also have at least 3 R functions tailored to "origin-destination", "number of trips", and to "overnight stays" datasets. And each of these R functions handle some workflows that are slightly different, but also have commonalities.

So maybe, it is a good idea to refactor the package code in such a way that the logic of what is done with the raw CSV data is handled to even greater extent in .sql files, as these can contain any number of step by step operations. We will still need to take some values of R variables and inject them into the SQL statements we load form .sql files, but that may lead to a significantly more concise and therefore more maintainable code.

Perhaps, since currently everything seems to be working fine (at least in the https://github.com/rOpenSpain/spanishoddata/tree/v2-codebook branch), this is not a priority for the first stable release.

Robinlovelace commented 2 weeks ago

Moving more of the code to .sql could make the package easier to port to other languages and easier to maintain. I like the idea, but would implementing it take more developer time than the savings through easier maintenance, I wonder.