uwdata / mosaic

An extensible framework for linking databases and interactive views.
https://idl.uw.edu/mosaic
Other
851 stars 56 forks source link

Query that filters across two tables #344

Open kolmakova opened 7 months ago

kolmakova commented 7 months ago

Hi! I am building a custom MosaicClient for my own visualization components. I have two Selection instances using crossfilter - one for table1 and another for table2. Each Selection can have few histograms with own conditions. I'm wondering if it's possible to create a filter that queries across both Selections on the different tables?

The case is that table2 has ids and corresponding values, while table1 has links between the ids in table2 and values for each link - so table1.source and table1.target columns are the same type as table2.id.

I need to do something like this query below but Query has no join method and I have no idea how to tell Query how it should compare table2.id with table1.source and table1.source using existing methods.

SELECT 
  table2.value, 
  table2.id,
FROM table2 JOIN table1 ON table2.id = table1.source OR table2.id = table1.target 
WHERE 
  table2.value >= 46 AND table2.value <= 50
  ...another crossfiltered conditions from table2 Selection
AND
  table1.value >= 0 AND table1.value <= 10
  ...another crossfiltered conditions from table1 Selection

I've also tried to construct my own query class that extends Query with a .join() method. However, I could not construct this query completely using only Query methods, there are still parts that require string construction:

My extended Query class is:

export class SuperQuery extends Query {
  constructor () {
    super()
    // Initialize the join array in the new class
    this.query.join = []
  }

  join (table, condition) {
    this.query.join.push({ table, condition })
    return this
  }

  toString () {
  // First, build the SELECT and FROM clauses using the parent toString method
    const selectFromSql = super.toString().split(' WHERE ')[0]

    // Then, build the JOIN clauses if any
    let joinSql = ''
    if (this.query.join.length) {
      joinSql = this.query.join.map(({ table, condition }) => `JOIN ${table} ON ${condition}`).join(' ')
    }

    // Next, build the WHERE clause if any
    const whereClause = super.toString().split(' WHERE ')[1]
    let whereSql = ''
    if (whereClause) {
      whereSql = `WHERE ${whereClause}`
    }

    // Combine the SELECT, FROM, JOIN, and WHERE clauses in the correct order
    return `${selectFromSql} ${joinSql} ${whereSql}`.trim()
  }
}

And its usage in MosaicClient:

// filter here is a Selection instance of table2
query (filter: string[] = []): Query {
  const table2Filter = filter.join(' AND ')
  const table1Filter = table1Selection.clauses.map(c => c.predicate).join(' AND ')

  return SuperQuery
    .select(fields)
    .from(table2)
    .join(table1, `table2.id = table1.source OR table2.id = table1.target`)
    .where(`${table1Filter} ${filter.length ? `${!table1Selection.clauses.length ? '' : ' AND'} ${table2Filter}` : ''}`)
  }

It generates a correct query and works on small tables (~30,000 rows), but on tables with 100,000+ rows, it does not work at all, no reaction on selection and no errors in the console. This is kind of strange, I suspect there may be an indexing issue caused by my string constructions because they can't be optimized by Mosaic. And I don't use .or() and .and() of Query here because they add quotes like "table2.id" for exact column name but this produces a broken query as there is no exact "table2.id" column in the context of comparing between few tables.

How can I solve this problem?

jheer commented 7 months ago

Mosaic does not currently have dedicated support for filtering across multiple tables. While optimizations such as query caching and query consolidation (grouping queries over the same table with the same groupby values) apply generally, our data cube indexes apply only to selections with clients that all target the same base table.

When feasible, I would recommend performing any joins as a pre-processing step: create a new table (or view) that joins the data together, and then use that as your base for visualization and interaction. The table or view could be created as part of your initial data loading; we do this in the unemployment map example.

Otherwise, it would be great to learn more about your use case needs. We'd like to provide richer support for cross-table linking, but we're still in a "requirements gathering" phase of design. (And I believe @domoritz and collaborators may also be looking into similar issues?)


As for this part: "It generates a correct query and works on small tables (~30,000 rows), but on tables with 100,000+ rows, it does not work at all, no reaction on selection and no errors in the console.", I don't know what is going on. Indeed it seems strange that the scale difference alone would fundamentally change things beyond performance (unless the joined results are so large they overrun browser memory limits). I'd start by looking at DuckDB logs to see if the queries are executing successfully, and of course you can also log your custom client's queryResult and queryError methods to see if a response is arriving.

domoritz commented 7 months ago

CC @willeppy

kolmakova commented 7 months ago

@jheer Thank you for the answer!

I'm working on a graph visualization. It has one raw data table and two additional tables: one for unique nodes and one for links between them. The graph relies on these generated nodes and links tables. The nodes and links have aggregated values calculated from the raw data, visualized in histograms for each value. If the aggregated value is by link, the histogram will select both nodes involved in the link within the selected value range. So selecting nodes in the graph by histogram ranges, crossfiltered by nodes and links, is a clear use case in the graph visualization context.

Crossfiltering between only nodes or only link values works super fast, I'll try to manage crossfiltering between nodes and links in the code, not SQL.

P.S. I've debugged my join queries on large tables and found that they execute successfully, but run very slowly - up to 2-10 minutes, even though the result set may only contain a few rows.