malloydata / malloy

Malloy is an experimental language for describing data relationships and transformations.
http://www.malloydata.dev
MIT License
1.99k stars 76 forks source link

Implement table sampling mechanisms to help limit query volume and testing #978

Open nryberg opened 1 year ago

nryberg commented 1 year ago

Using Malloy's shorthand, I'd like to be able to randomly select n records from a source table for the following reasons.

BigQuery Documentation

DuckDB Documentation

lloydtabb commented 1 year ago

So I've looked at this a bit. Currently BigQuery limits a sampled table to appearing only once in a query which creates all kinds of problems. The other problem is joining two sampled tables (because you will end up with missing records).

Our stance, generally, until we have the right underlying architecture to support this kind of thing is to give you access through SQL.

Here is how you might do it.

sql: airports_sql is {
  select: """
    select * from 'airports.parquet' USING SAMPLE 5
  """
}

source: foo is from_sql(airports_sql) {
  measure: c is count()
}

query: foo-> {aggregate: c}