stanford-futuredata / macrobase

MacroBase: A Search Engine for Fast Data
http://macrobase.stanford.edu/
Apache License 2.0
661 stars 126 forks source link

Design discussion: how to handle JSON #79

Closed pbailis closed 8 years ago

pbailis commented 8 years ago

We're going to need to run over some JSON files in the near future. Say we have a bunch of files of the form:

userLikes1.json
userLikes2.json
userLikes3.json
userComments1.json
userComments2.json
userComments3.json
clicks1.json
clicks2.json
clicks3.json

Each type of file has a different schema, and imagine we have 1000s of these files, totaling 100s of GB.

Topic for discussion: how do we want to parse these and formulate queries?

mamikonyana commented 8 years ago
  1. are the indices related from one stream to another?
  2. do we still assume we are doing SELECT * like queries?
  3. are this going to be loaded at the same time? (related to 2)
raininthesun commented 8 years ago

Why not transform json into relational tables, and load them into postgres? Just wondering how complex the transformation is.

pbailis commented 8 years ago

The goal here is to do this in a somewhat generic way.

One thought is to have a JSONLoader class that allows you to declare "virtual tables" over each file (e.g., comments: userComments*.json) and then select metrics and attributes of the form table.feature.

There are a few related projects:

Jackson gives us cross-language support: https://github.com/FasterXML/jackson

pbailis commented 8 years ago

@mamikonyana

are the indices related from one stream to another?

Likely, yes, you'll want to do joins.

do we still assume we are doing SELECT * like queries?

Ideally. Otherwise, how else do we want to do this?

are this going to be loaded at the same time? (related to 2)

Depends. What's most expedient?

pbailis commented 8 years ago

@raininthesun

Why not transform json into relational tables, and load them into postgres? Just wondering how complex the transformation is.

May be doable. Postgres is very slow compared to reading from disk. I am curious -- is there an easy way to take JSON and put it into an in-memory JDBC-like DB instead of Postgres?

deepakn94 commented 8 years ago

We could load one of each type of file at a time, and consider that to be a "streaming" query, maybe?

pbailis commented 8 years ago

@deepakn94 The question is how to do this if we want to join, say, clicks with comments.

pbailis commented 8 years ago

Per @raininthesun's suggestion, it should be possible to use Postgres's built-in JSON support.

We could also use Postgres's foreign data wrapper support; someone already has one for JSON: http://pgxn.org/dist/json_fdw/

I believe DeepDive uses Greenplum, and we could too. (However, Greenplum doesn't have JSON support yet.)

pbailis commented 8 years ago

Postgres JSON loader: https://github.com/lukasmartinelli/pgfutter

deepakn94 commented 8 years ago

@pbailis I think this leads to a broader question of how we want to handle multiple incoming streams in general (especially if you could possibly join them), right?

viveksjain commented 8 years ago

This is perhaps an aside, but have you tried to look into why Postgres is so slow? I would expect SELECT * queries, especially without joins, to be similar speed to using our own disk cache. What's the approximate slowdown with Postgres? Is it because SELECT * is trying to load everything into memory and swapping, and we should be batching reads instead?

deepakn94 commented 8 years ago

The machine we're on has 250 Gigabytes of RAM, so I would be surprised if it's because of memory + swapping. I don't think Postgres isn't optimized for full range queries (performs much better when index lookups are feasible): I think the layers of abstraction that help to keep things well organized completely screw the performance, but I could be wrong here.

viveksjain commented 8 years ago

I wonder if our use of nested queries also kills performance.

pbailis commented 8 years ago

@viveksjain: I believe Postgres is smart enough to push the column selections into the subquery. My hypothesis why disk caching is fast is that we're only scanning over the columns that we want. In contrast, when we select a particular set of columns from Postgres, it has to scan over all of the data in each row due to its row-oriented layout. With wide rows (i.e., many columns, as in CMT), this is expensive.

A columnar-oriented storage engine should help here.

postgres=# \timing
Timing is on.
postgres=# SELECT COUNT(dataset_id) FROM (SELECT * from mapmatch_history) AS bq;
  count   
----------
 ZZZ
(1 row)

Time: 16401.287 ms
postgres=# SELECT COUNT(dataset_id) FROM mapmatch_history;
  count   
----------
 ZZZ
(1 row)

Time: 16096.268 ms
postgres=# EXPLAIN SELECT dataset_id FROM (SELECT * from mapmatch_history) AS bq;
                                 QUERY PLAN                                  
-----------------------------------------------------------------------------
 Seq Scan on mapmatch_history  (cost=0.00..ZZZ rows=ZZZ width=4)
(1 row)

Time: 25.282 ms
postgres=# 
pbailis commented 8 years ago

Going to experiment with Postgres this week.