uwescience / raco

Compilation and rule-based optimization framework for relational algebra. Raco is the language, optimization, and query translation layer for the Myria project.
Other
72 stars 19 forks source link

SciDB Proof of Concept #216

Open billhowe opened 10 years ago

billhowe commented 10 years ago

Preliminary:

V0.0: Simple pass-through test

V0.1 A minimal support for AQL in MyriaL.

The MyriaL program will look something like this (is this IPython? Can we use IPython? Should we demo this in IPython):

Myria is MyriaConnector(masternode, "MyriaL")
%%Default=Myria

AQL is SciDB(hostname, "aql")

X = SELECT * FROM Table

Y(average:int) = %AQL
SELECT avg ( S.v1 )
  FROM Simple_Array S 
 WHERE S.v3 = ‘Odd’ 
 GROUP BY S.I;
%
  • [ ] Get this deployed so that it's visible from the web as an example.

V0.2: Python-level communication

  • [ ] Implement MoveMyriaToSciDB: Use Jake's library + shim to load data into SciDB if possible.
  • [ ] Implement MoveSciDBToMyria: Pull data from SciDB and load it into Myria using the REST interface. The data will be passed through Python, so it will be assumed to be small. That's ok.
  • [ ] Extend MyriaL to include syntax for returning results.
  • [ ] Test this with a MyriaL program that may look something like this:
Z = SELECT count(*) as foo FROM SomeTable 
MoveMyriaToSciDB(Z, "Z")
-- maybe this is eventually automatic, if and when we parse AQL
X = %AQL
SELECT avg ( S.v1 ) as avg
  FROM Simple_Array S, Z       -- does this cross-product work?
 WHERE S.v3 < Z.foo 
%
MoveSciDBtoMyria(X, "X")
Y = SELECT count(*) FROM Table WHERE val < X.avg

The physical plan will look something like this:

RunMyria(plan, RunAQL("SELECT..."))

V0.3: Move large data around by allowing MyriaX to communicate directly with SciDB

  • [ ] Implement a Myria operator PullFromSciDB, maybe as a subclass of Scan.
  • [ ] Add a rule to generate it in some hardcoded special case
  • [ ] Implement a Myria operator PushToSciDB, maybe as a subclass of Store. It may need to shell copy files around, then call a load interface. (Three paths here: 1) Push over http (small data, but from python), 2) use the bulk loader (gin up binary files, shell them over to the SciDB, and call the loader), 3) Every myria worker builds one small part of the final array, then use method 2), 4) Maybe JDBC?
  • [ ] Add a rule to generate it in some hardcoded special case

V0.4: Compile a pure MyriaL program to Federated

  • [ ] Add a rule to Federated to take a branch of an RA plan and generate AQL. (AQL is basically SQL, so this is easy). The rule can hardcode the decision on which branch to pick. Ultimately this will be an optimization decision. Translate a branch of the RA plan into a RunAQL operator.
  • [ ] Test this.
7andrew7 commented 10 years ago

@billhowe @mbalazin @jakevdp @ryanmaas @dhalperi I need some help to further refine these ideas into a more concrete spec. In particular, integrating this federated data model into the web UI is more complicated than I realized.

The current web UI (https://demo.myria.cs.washington.edu/editor), has panes for an editor, queries, and datasets. All of these are hard-wired to a single Myria instance. For the purposes of this demo, much of the UI will need to be revised and expanded:

  1. The editor pane stays mostly the same; the query language itself will be expanded to support a mix of AQL and SQL in the manner described above.
  2. The query pane must change to incorporate the federated data model. What happens when a "query" consists of a mix of different languages targeting different backends? Do we generate one query per language/backend or a single result? What is the "status" of such a query? Do we need a new catalogue to keep track of query information that isn't maintained by one of the backend systems?
  3. Likewise, the dataset pane must change to reflect the federated data model. Do we generate one result per language+query or a single result per query? How does the system even learn about AQL results? Do we allow for exporting AQL datasets as CSV, JSON, etc. Do we need a new catalogue to track data sets across the various backends?

Another option is to forego this complexity by designing a simpler web UI -- ipython notebook was mentioned as an example here. But, there are still basic problems here that we haven't hashed out. What happens when a user clicks go on a combined SQL+AQL query? Where do the results go, and how does the user see them?

billhowe commented 10 years ago

Here's my thought.

I think we only need or want V0 for now.

V0:

  • Assume that there will be a custom UI for the demo, one that we may not build. So we're free to do whatever is easiest to show that things work.
  • When the query is finished, show a link that, when clicked, will call SciDB to fetch the dataset via http/shim. Jake can show how to do this.

V1:

  • Figure out how to ask SciDB for its catalog. The catalog is in postgres, I believe. Jake may know, or Emad will almost certainly know.
  • Add another tab to the UI called "SciDB Datasets" and rename to existing tab "Myria Datasets"
  • Populate the SciDB datasets tab with the catalog on demand, and make each dataset a live link to download it.

V2:

  • Some kind of federated catalog

On Mon, Jun 16, 2014 at 3:33 PM, Andrew Whitaker notifications@github.com wrote:

@billhowe https://github.com/billhowe @mbalazin https://github.com/mbalazin @jakevdp https://github.com/jakevdp @ryanmaas https://github.com/ryanmaas @dhalperi https://github.com/dhalperi I need some help to further refine these ideas into a more concrete spec. In particular, integrating this federated data model into the web UI is more complicated than I realized.

The current web UI (https://demo.myria.cs.washington.edu/editor), has panes for an editor, queries, and datasets. All of these are hard-wired to a single Myria instance. For the purposes of this demo, much of the UI will need to be revised and expanded:

  1. The editor pane stays mostly the same; the query language itself will be expanded to support a mix of AQL and SQL in the manner described above.
  2. The query pane must change to incorporate the federated data model. What happens when a "query" consists of a mix of different languages targeting different backends? Do we generate one query per language/backend or a single result? What is the "status" of such a query? Do we need a new catalogue to keep track of query information that isn't maintained by one of the backend systems?
  3. Likewise, the dataset pane must change to reflect the federated data model. Do we generate one result per language+query or a single result per query? How does the system even learn about AQL results? Do we allow for exporting AQL datasets as CSV, JSON, etc. Do we need a new catalogue to track data sets across the various backends?

Another option is to forego this complexity by designing a simpler web UI -- ipython notebook was mentioned as an example here. But, there are still basic problems here that we haven't hashed out. What happens when a user clicks go on a combined SQL+AQL query? Where do the results go, and how does the user see them?

— Reply to this email directly or view it on GitHub https://github.com/uwescience/raco/issues/216#issuecomment-46246622.

7andrew7 commented 10 years ago

Some notes from today's meeting with @7andrew7, @mbalazin, @ryanmaas, @jakevdp, and Emad. Overall, this doesn't radically diverge from Bill's vision described above.

We proposed a very simple version 0. The goal here is to run a single query that combines SQL with AQL in a very simplistic way. In particular, the SQL and AQL don't interact at all. As an example:

X = %sql select * from foo where foo.x > 30;
Y = %aql select * from bar where bar.y < 25;

store(X, andrew:myprogram:SqlOut);
store(Y, andrew:myprogram:AqlOut);

Deliverable: run this query from the web UI. Validate by hand that the results appear in the appropriate backend databases.. This is even dumber than the v0 proposed by @billhowe .

v1: Show AQL datasets and query results from the web UI. Create a python API for querying scidb's catalogue and for fetching results. Expose these results via the web UI.

Deliverable: Expose scidb results via a new tab on the web UI. Strip away all tabs that are not relevant for our demo.

v2: Create an integrated example, where the myrial/sql query interacts with an aql query. Andrew proposed a syntax like this:

X = %sql select * from foo where foo.x > 30;
Y = %aql select * from bar where bar.y < 25;

YR = AsRelation(Y, x:int, y:int, z:int);
J = %sql select X.x, YR.y where X.x=YR.y;
Store(J, andrew:myprogram:Output);

There are various implementation options here, and we didn't decide on a single right answer. There is a research task to explore the set of export options that scidb supports.

billhowe commented 10 years ago

Looks good.

Is %sql the same as MyriaL? Or, can there be a %myrial? We should demonstrate the fact that we already have a language. I don't want it to look like we hacked up everything on the fly.

Let's update the original checklist with this material, though -- the checklist should reflect the current plan.

Specifically:

  • The new v0 is analogous to the original V0.1. But we still need to start with V0.0, which does not require any interaction on the web at all. In particular, we don't want to spend time changing the parser or the web UI until we prove that we can run queries against SciDB from plain 'ol Python (via raco).
  • We still want the "Federated" language inside raco, unless there is a better suggestion on how specifically to implement this. The MyriaLanguage should not be polluted with AQL support. (But I can imagine other ways -- just need to be concrete, make a plan, and see it through)
  • The new v1 (returning results to the web) is indeed new, and important -- let's add it as V0.15 maybe?
  • The new v2 depends on having something like the existing V0.3, so we can't skip that.
  • The new v2 is harder than the existing versions, as it involves a join across systems instead of an explicit move. Maybe replace the existing V0.4 with this. But I'd recommend doing the simplest possible thing before doing an implicit cross-system join.

If this all makes sense, I can edit the checklist.

On Tue, Jun 17, 2014 at 5:51 PM, Andrew Whitaker notifications@github.com wrote:

Some notes from today's meeting with @7andrew@, @mbalazin https://github.com/mbalazin, @ryanmaas https://github.com/ryanmaas, @jakevdp https://github.com/jakevdp, and Emad. Overall, this doesn't radically diverge from Bill's vision described above.

We proposed a very simple version 0. The goal here is to run a single query that combines SQL with AQL in a very simplistic way. In particular, the SQL and AQL don't interact at all. As an example:

X = %sql select * from foo where foo.x > 30; Y = %aql select * from bar where bar.y < 25;

store(X, andrew:myprogram:SqlOut); store(Y, andrew:myprogram:AqlOut);

Deliverable: run this query from the web UI. Validate by hand that the results appear in the appropriate backend databases.. This is even dumber than the v0 proposed by @billhowe https://github.com/billhowe .

v1: Show AQL datasets and query results from the web UI. Create a python API for querying scidb's catalogue and for fetching results. Expose these results via the web UI.

Deliverable: Expose scidb results via a new tab on the web UI. Strip away all tabs that are not relevant for our demo.

v2: Create an integrated example, where the myrial/sql query interacts with an aql query. Andrew proposed a syntax like this:

X = %sql select * from foo where foo.x > 30; Y = %aql select * from bar where bar.y < 25;

YR = AsRelation(Y, x:int, y:int, z:int); J = %sql select X.x, YR.y where X.x=YR.y; Store(J, andrew:myprogram:Output);

There are various implementation options here, and we didn't decide on a single right answer. There is a research task to explore the set of export API options that scidb supports.

— Reply to this email directly or view it on GitHub https://github.com/uwescience/raco/issues/216#issuecomment-46384267.

7andrew7 commented 10 years ago

Yes, sql is myrial. We can support both tags/labels.

  • The new v0 is analogous to the original V0.1. But we still need to start

    with V0.0, which does not require any interaction on the web at all. In particular, we don't want to spend time changing the parser or the web UI until we prove that we can run queries against SciDB from plain 'ol Python (via raco).

    Currently, raco doesn't connect to anything -- it's a language parser/compiler. For example, we don't connect to myria directly from raco. raco is a standalone entity, and I'm not wild about incorporating back-end specific connection logic.

  • We still want the "Federated" language inside raco, unless there is a better suggestion on how specifically to implement this. The MyriaLanguage should not be polluted with AQL support. (But I can imagine other ways -- just need to be concrete, make a plan, and see it through)

I'm vague on how this will actually work. I was assuming that we wanted some common algebra where relational operators and AQL operators / queries co-exist. Otherwise, how will we ever express a program that combines both back ends?

To be more precise, what is the output of compiling a program with a combination of sql/myrial and aql? I was envisioning a single top-level sequence or parallel operator. The top-level operator would have a combination of relational operators and scidb operators. The latter is just a wrapper around an AQL query since we don't currently know how to parse this language.

Input program (combination of Myrial + AQL) ==> Common algebra

I suspect you have some other mental model in mind.

  • The new v2 is harder than the existing versions, as it involves a join

    across systems instead of an explicit move. Maybe replace the existing V0.4 with this. But I'd recommend doing the simplest possible thing before doing an implicit cross-system join.

Fair enough. Honestly, I'm not sure how to implement even the simplest possible query that involves coordination across backends. Where does the coordination happen? It can't happen on appengine given its timeout restrictions. So, there may need to be some sort of meta-coordinator that takes a single query and decomposes it across the federation. I'm worried about lots of hidden complexity here.

-- andrew

billhowe commented 10 years ago

Here's what I had in mind: https://github.com/uwescience/raco/tree/bigdog

More clarifications below.

On Tue, Jun 17, 2014 at 10:23 PM, Andrew Whitaker notifications@github.com wrote:

Yes, sql is myrial. We can support both tags/labels.

  • The new v0 is analogous to the original V0.1. But we still need to start

    with V0.0, which does not require any interaction on the web at all. In particular, we don't want to spend time changing the parser or the web UI until we prove that we can run queries against SciDB from plain 'ol Python (via raco).

    Currently, raco doesn't connect to anything -- it's a language parser/compiler. For example, we don't connect to myria directly from raco. raco is a standalone entity, and I'm not wild about incorporating back-end specific connection logic.

I'm not sure where you're proposing this logic reside. MyriaWeb only?

I think the Federated algebra may need to be aware of the systems it is federating.

So why an algebra? We want to reason about federated plans the same way we reason about other plans. There are operators to move data back and forth, run queries, etc.

  • We still want the "Federated" language inside raco, unless there is a better suggestion on how specifically to implement this. The MyriaLanguage should not be polluted with AQL support. (But I can imagine other ways -- just need to be concrete, make a plan, and see it through)

I'm vague on how this will actually work. I was assuming that we wanted some common algebra where relational operators and AQL operators / queries co-exist. Otherwise, how will we ever express a program that combines both back ends?

https://github.com/uwescience/raco/tree/bigdog

To be more precise, what is the output of compiling a program with a combination of sql/myrial and aql?

Doesn't necessarily need to be compiled to an ascii string -- just need some object that we can execute.

I was envisioning a single top-level sequence or parallel operator. The top-level operator would have a combination of relational operators and scidb operators. The latter is just a wrapper around an AQL query since we don't currently know how to parse this language.

Exactly. But we need to map the logical plan down to a physical plan before running it. That's what the Federated algebra is for -- a target for this mapping.

Input program (combination of Myrial + AQL) ==> Common algebra

Eventually, but this is not on the immediate critical path.

I suspect you have some other mental model in mind.

I think we mostly agree.

  • The new v2 is harder than the existing versions, as it involves a join

    across systems instead of an explicit move. Maybe replace the existing V0.4 with this. But I'd recommend doing the simplest possible thing before doing an implicit cross-system join.

Fair enough. Honestly, I'm not sure how to implement even the simplest possible query that involves coordination across backends. Where does the coordination happen? It can't happen on appengine given its timeout restrictions. So, there may need to be some sort of meta-coordinator that takes a single query and decomposes it across the federation. I'm worried about lots of hidden complexity here.

Don't worry about appengine restrictions. Let's get it working from Python first; that's why V0.1 exists.

You can poll from the client in RESTful situations.

(However, it could be a problem is SciDB only allows synchronous communication through its http interface. @jakevdp or @esoroush ?)

-- andrew

— Reply to this email directly or view it on GitHub https://github.com/uwescience/raco/issues/216#issuecomment-46397175.

7andrew7 commented 10 years ago

Yes, I was proposing to embed the query dispatch logic in myria-web. This is how we currently execute queries, and it seemed natural to keep the same basic structure. Another alternative is to create a bigdog repository that comprises raco, scidb, myria-python, etc. I would prefer this to forcing raco to take on a dependency for every backend.

Even if the logic resides within raco, I don't like the idea of overloading the optimization framework to execute queries. Also, this approach will conflict with the outstanding changes of @dhalperi to the optimization framework in #248. Dan changes the interface to optimize to accept a single top-level operator (sequence or parallel).

We basically agree on the raco changes, and I can incorporate those without too much effort.

billhowe commented 10 years ago

On Wed, Jun 18, 2014 at 7:53 PM, Andrew Whitaker notifications@github.com wrote:

Yes, I was proposing to embed the query dispatch logic in myria-web. This is how we currently execute queries, and it seemed natural to keep the same basic structure. Another alternative is to create a bigdog repository that comprises raco, scidb, myria-python, etc. I would prefer this to forcing raco to take on a dependency for every backend.

We generate code that only Myria/Grappa understands. I figure we've already eaten a dependency.

Even if the logic resides within raco, I don't like the idea of overloading the optimization framework to execute queries. Also, this approach will conflict with the outstanding changes of @dhalperi https://github.com/dhalperi to the optimization framework in #248 https://github.com/uwescience/raco/pull/248. Dan changes the interface to optimize to accept a single top-level operator (sequence or parallel).

I don't see a conflict; #248 seems great.

Somewhere we have to implement the cross-system Sequence operator that says "Do X in Myria, then do Y in SciDB."

I think this logic should be in a new "Federated" algebra.

I don't think it makes sense to do this in MyriaWeb, because we want to run queries sourced from somewhere other than the browser.

But in any case, let's get something working and then refactor.

We basically agree on the raco changes, and I can incorporate those without too much effort.

— Reply to this email directly or view it on GitHub https://github.com/uwescience/raco/issues/216#issuecomment-46518610.

7andrew7 commented 10 years ago

As a compromise, I created a new repository to house the mimic demo logic. Below is a script that parses a Myrial/AQL program and extracts the AQL fragments. I'm working with @jakevdp and @esoroush to make this actually connect to scidb.

https://github.com/uwescience/mimic/blob/master/run.py

7andrew7 commented 10 years ago

@billhowe @jakevdp I hacked Bill's bigdog branch to actually connect to scidb; I think this completes the v0.0 test. Caveats: the "AQL" language is actually "AFL", as per our previous discussion.

https://github.com/uwescience/raco/compare/bigdog

billhowe commented 10 years ago

GREAT!!!!

I think you also have most of 0.1 done, as well?

(And one of the undone items in 0.1 probably shouldn't even be there -- the "MoveToX" operators don't need to exist quite yet.)

I checked off the items that I think are complete on v0.1.

I think the next step, after SIGMOD demos, is to get an example of AFL and MyriaL, together, working from the web. We can use Dan's separate deployment onc we turn it back on (again, after SIGMOD)

On Sun, Jun 22, 2014 at 9:43 AM, Andrew Whitaker notifications@github.com wrote:

@billhowe https://github.com/billhowe @jakevdp https://github.com/jakevdp I hacked Bill's bigdog branch to actually connect to scidb; I think this completes the v0.0 test. Caveats: the "AQL" language is actually "AFL", as per our previous discussion.

https://github.com/uwescience/raco/compare/bigdog

— Reply to this email directly or view it on GitHub https://github.com/uwescience/raco/issues/216#issuecomment-46785910.

dhalperi commented 10 years ago

Cool stuff!