Closed pbailis closed 8 years ago
SELECT *
like queries?Why not transform json into relational tables, and load them into postgres? Just wondering how complex the transformation is.
The goal here is to do this in a somewhat generic way.
One thought is to have a JSONLoader
class that allows you to declare "virtual tables" over each file (e.g., comments: userComments*.json
) and then select metrics and attributes of the form table.feature
.
There are a few related projects:
Jackson gives us cross-language support: https://github.com/FasterXML/jackson
@mamikonyana
are the indices related from one stream to another?
Likely, yes, you'll want to do joins.
do we still assume we are doing SELECT * like queries?
Ideally. Otherwise, how else do we want to do this?
are this going to be loaded at the same time? (related to 2)
Depends. What's most expedient?
@raininthesun
Why not transform json into relational tables, and load them into postgres? Just wondering how complex the transformation is.
May be doable. Postgres is very slow compared to reading from disk. I am curious -- is there an easy way to take JSON and put it into an in-memory JDBC-like DB instead of Postgres?
We could load one of each type of file at a time, and consider that to be a "streaming" query, maybe?
@deepakn94 The question is how to do this if we want to join, say, clicks with comments.
Per @raininthesun's suggestion, it should be possible to use Postgres's built-in JSON support.
We could also use Postgres's foreign data wrapper support; someone already has one for JSON: http://pgxn.org/dist/json_fdw/
I believe DeepDive uses Greenplum, and we could too. (However, Greenplum doesn't have JSON support yet.)
Postgres JSON loader: https://github.com/lukasmartinelli/pgfutter
@pbailis I think this leads to a broader question of how we want to handle multiple incoming streams in general (especially if you could possibly join them), right?
This is perhaps an aside, but have you tried to look into why Postgres is so slow? I would expect SELECT *
queries, especially without joins, to be similar speed to using our own disk cache. What's the approximate slowdown with Postgres? Is it because SELECT *
is trying to load everything into memory and swapping, and we should be batching reads instead?
The machine we're on has 250 Gigabytes of RAM, so I would be surprised if it's because of memory + swapping. I don't think Postgres isn't optimized for full range queries (performs much better when index lookups are feasible): I think the layers of abstraction that help to keep things well organized completely screw the performance, but I could be wrong here.
I wonder if our use of nested queries also kills performance.
@viveksjain: I believe Postgres is smart enough to push the column selections into the subquery. My hypothesis why disk caching is fast is that we're only scanning over the columns that we want. In contrast, when we select a particular set of columns from Postgres, it has to scan over all of the data in each row due to its row-oriented layout. With wide rows (i.e., many columns, as in CMT), this is expensive.
A columnar-oriented storage engine should help here.
postgres=# \timing
Timing is on.
postgres=# SELECT COUNT(dataset_id) FROM (SELECT * from mapmatch_history) AS bq;
count
----------
ZZZ
(1 row)
Time: 16401.287 ms
postgres=# SELECT COUNT(dataset_id) FROM mapmatch_history;
count
----------
ZZZ
(1 row)
Time: 16096.268 ms
postgres=# EXPLAIN SELECT dataset_id FROM (SELECT * from mapmatch_history) AS bq;
QUERY PLAN
-----------------------------------------------------------------------------
Seq Scan on mapmatch_history (cost=0.00..ZZZ rows=ZZZ width=4)
(1 row)
Time: 25.282 ms
postgres=#
Going to experiment with Postgres this week.
We're going to need to run over some JSON files in the near future. Say we have a bunch of files of the form:
Each type of file has a different schema, and imagine we have 1000s of these files, totaling 100s of GB.
Topic for discussion: how do we want to parse these and formulate queries?