WHY VIEWS?: Use Case for SQL VIEWs in Social Science data
SQL VIEWS are like "syntactic sugar" for databases. SQL VIEWs cover up
ugly, awkward stuff and make life easier.
SQL VIEWs cover up a lot of data wrangling, such as SQL JOINS, date parsing,
string manipulation and calculations. The kind of code you only want to
write once and never look at it again.
Ever try write your own conversion from another date format to YYYY-MM-DD?
Ever try to calculate an age from a YYYY-MM-DD string?
Ever try to do both of the above in SQL?
Its not too hard if you know string functions, but date calculations, like JOINS
are not something one would want to do every time one sits down to analyze data.
Interactively, work out the convoluted SQL SELECT statement you need
and then CREATE VIEW viewname AS SELECT ... . After that you simply
use the viewname where you would have used a tablename and you never have
to look at the convoluted SELECT statement again until something breaks
or you are writing documentation.
Do you like writing JOINS?
"I’m not terribly interested in executing cross-table queries"
SQL JOINS are a pain to write (MS Access lets you draw a line between
fields in different boxes representing tables), but JOINS have their
uses:
Lookup tables (code or acronym to full phrase)
Ragged data (which would lead to sparse arrays)
Matching data (example, a geographic region such as a country or US state)
Do "real data scientists" use JOINs?
Take a look at Robert Gentleman and Thomas Lumley's use of SQLite and R.
In the book "R Programming for Bioinformatics" (2008) Robert Gentleman writes:
"Recently the Bioconductor Project has begun moving [gene]
annotation packages into a relational database format,
relying primarily on SQLite [page 238] ...the database contains
a number of tables that map between identifiers on the
Affymetrix HG-U95Av2 GeneChip and the different quantities
of interest such as the GO [Gene Ontology Consortium]
categories or PubMed IDs (that map published papers that
discuss the corresponding genes)." page 241
While Thomas Lumley writes in his book, "Complex Surveys" (2010):
"BRFSS The Behavioral Risk factor Surveillance system is a telephone survey
of behavioral risk factors for disease. ...The number of states has increased
from 15 in 1984 to all 50 in 2007 (plus the District of Columbia, Guam,
Puerto Rico, and the US Virgin Islands) and the sample size
from 12000 to 430000. It is now the worlds largest telephone survey [page 8]
... write it to SQLite with dbWriteTable(). With 430,000 records, this data set
is close to the limit for interactive analysis [using R] on a small computer:
analyses will run, but even simple estimates take a minute or so. Since BRFSS
is stratified by state at the first stage it is legitimate to work with
a subset of the data ...For example to analyze California's BRFSS data we
can create a view containing only the data for California (FIPS code 6)"
pages 246-247
So, real data scientists working in areas such as bioinformatics and public health
use SQL VIEWS in SQLite3.
WHY VIEWS?: Use Case for SQL VIEWs in Social Science data SQL VIEWS are like "syntactic sugar" for databases. SQL VIEWs cover up ugly, awkward stuff and make life easier.
SQL VIEWs cover up a lot of data wrangling, such as SQL JOINS, date parsing, string manipulation and calculations. The kind of code you only want to write once and never look at it again.
SQLite3 recommends storing dates as YYYY-MM-DD time strings. http://www.sqlite.org/lang_datefunc.html
Ever try write your own conversion from another date format to YYYY-MM-DD? Ever try to calculate an age from a YYYY-MM-DD string? Ever try to do both of the above in SQL? Its not too hard if you know string functions, but date calculations, like JOINS are not something one would want to do every time one sits down to analyze data.
Interactively, work out the convoluted SQL SELECT statement you need and then CREATE VIEW viewname AS SELECT ... . After that you simply use the viewname where you would have used a tablename and you never have to look at the convoluted SELECT statement again until something breaks or you are writing documentation.
Do you like writing JOINS? "I’m not terribly interested in executing cross-table queries" SQL JOINS are a pain to write (MS Access lets you draw a line between fields in different boxes representing tables), but JOINS have their uses:
Do "real data scientists" use JOINs? Take a look at Robert Gentleman and Thomas Lumley's use of SQLite and R.
In the book "R Programming for Bioinformatics" (2008) Robert Gentleman writes:
While Thomas Lumley writes in his book, "Complex Surveys" (2010):
So, real data scientists working in areas such as bioinformatics and public health use SQL VIEWS in SQLite3.