jhamrick / dbtools

Tools for interfacing with SQLite databases
Other
34 stars 4 forks source link

Use Case for SQL VIEWs #15

Open JimCallahanOrlando opened 10 years ago

JimCallahanOrlando commented 10 years ago

WHY VIEWS?: Use Case for SQL VIEWs in Social Science data SQL VIEWS are like "syntactic sugar" for databases. SQL VIEWs cover up ugly, awkward stuff and make life easier.

SQL VIEWs cover up a lot of data wrangling, such as SQL JOINS, date parsing, string manipulation and calculations. The kind of code you only want to write once and never look at it again.

SQLite3 recommends storing dates as YYYY-MM-DD time strings. http://www.sqlite.org/lang_datefunc.html

Ever try write your own conversion from another date format to YYYY-MM-DD? Ever try to calculate an age from a YYYY-MM-DD string? Ever try to do both of the above in SQL? Its not too hard if you know string functions, but date calculations, like JOINS are not something one would want to do every time one sits down to analyze data.

Interactively, work out the convoluted SQL SELECT statement you need and then CREATE VIEW viewname AS SELECT ... . After that you simply use the viewname where you would have used a tablename and you never have to look at the convoluted SELECT statement again until something breaks or you are writing documentation.

Do you like writing JOINS? "I’m not terribly interested in executing cross-table queries" SQL JOINS are a pain to write (MS Access lets you draw a line between fields in different boxes representing tables), but JOINS have their uses:

  1. Lookup tables (code or acronym to full phrase)
  2. Ragged data (which would lead to sparse arrays)
  3. Matching data (example, a geographic region such as a country or US state)

Do "real data scientists" use JOINs? Take a look at Robert Gentleman and Thomas Lumley's use of SQLite and R.

In the book "R Programming for Bioinformatics" (2008) Robert Gentleman writes:

"Recently the Bioconductor Project has begun moving [gene]
annotation packages into a relational database format,
relying primarily on SQLite [page 238] ...the database contains 
a number of tables that map between identifiers on the 
Affymetrix HG-U95Av2 GeneChip and   the different quantities 
of interest such as the GO [Gene Ontology Consortium] 
categories  or PubMed IDs (that map published papers that 
discuss the corresponding genes)." page 241

While Thomas Lumley writes in his book, "Complex Surveys" (2010):

"BRFSS The Behavioral Risk factor Surveillance system is a telephone survey 
of behavioral risk factors for disease.  ...The number of states has increased
from 15 in 1984 to all 50 in 2007 (plus the District of Columbia, Guam,
Puerto Rico, and the US Virgin Islands) and the sample size 
from 12000 to 430000. It is now the worlds largest telephone survey [page 8]
... write it to SQLite with dbWriteTable(). With 430,000 records, this data set 
is close to the limit for interactive analysis [using R] on a small computer: 
analyses will run, but even simple estimates take a minute or so. Since BRFSS 
is  stratified by state at the first stage it is legitimate to work with 
a subset of the data ...For example to analyze California's BRFSS data we 
can create  a view containing only the data for California (FIPS code 6)" 
pages 246-247

So, real data scientists working in areas such as bioinformatics and public health use SQL VIEWS in SQLite3.