mountetna / magma

Data server with friendly data loaders
GNU General Public License v2.0
5 stars 2 forks source link

/query excludes rows with empty data #100

Closed graft closed 5 years ago

graft commented 5 years ago

Many queries take the form of returning a "Vector" predicate (which should probably be renamed to "Table" predicate or something) - the intent is to return an array of results for each row in the query. E.g.

[ model, ...filters, '::all', [ result1, result2, result3 ] ]

where result1, etc., map the returned model items to output values (e.g. [ model_column ] being the simplest possible mapping).

Often these results will map columns from same table several times; for example, if I have a 'sample' model, I might want to map several 'sample.gene_exp.count' values, one for each value of 'gene_exp.name', i.e., for each sample I might want to know the counts for GENE1, GENE2, and GENE3. The resulting query might look like this:

[ 'sample', ['sample_name', '::matches', 'BOB' ], '::all',
  [
    [ 'gene_exp', [ 'name', '::equals', 'GENE1' ], '::first', 'count' ],
    [ 'gene_exp', [ 'name', '::equals', 'GENE2' ], '::first', 'count' ],
    [ 'gene_exp', [ 'name', '::equals', 'GENE3' ], '::first', 'count' ]
  ]
]

This query produces a SQL statement which uses LEFT OUTER JOIN to join the sample table to the gene_exp table three times (with three different aliases for gene_exp). Currently the statement will join on the foreign key (i.e. ON sample.id = gene_exp1.sample_id); subsequently the joined table will be filtered using WHERE clauses to reduce it to the relevant rows, i.e. WHERE gene_exp1.name = 'GENE1' AND gene_exp2.name = 'GENE2' AND gene_exp3.name = 'GENE3'

In the situation where the gene_exp table lacks a row containing name='GENE1' but has a row containing name='GENE2' and 'GENE3', the WHERE condition will fail (there is no row in the join table satisfying all three AND clauses), and there will be no data returned. This is not what we want and defeats the whole purpose of using a LEFT OUTER JOIN, which is that we want there to be NULL rows when data is missing.

A solution is to move the WHERE conditions into the ON clause; this will mean the table will return rows even with missing data.

graft commented 5 years ago

My initial attempt produced some problems; while moving the WHERE conditions into the ON clause solved the initial problem it produced a bunch of other failing tests. The problem is that now we are unable to eliminate rows on filtering, because all of the filters are now being LEFT OUTER JOINed - clauses that are meant to reduce rows in the table instead retain the row with a NULL value.

It seems then that we only want to move conditions within the Vector predicate into ON clauses - because we want null values to appear in columns of our final table. But outside the Vector predicate (filters on the 'sample' table, for example) we want to actually eliminate rows, thus we do NOT want the condition to fall within the OUTER JOIN and produce a NULL entry.

The solution is to only move WHERE conditions to ON for the conditions within the Vector predicate.