SciDB Proof of Concept - Githubissues

uwescience / raco

Compilation and rule-based optimization framework for relational algebra. Raco is the language, optimization, and query translation layer for the Myria project.

Other

72 stars 19 forks source link

SciDB Proof of Concept #216

Open billhowe opened 10 years ago

billhowe commented 10 years ago

Preliminary:

[x] Get MIMIC deceased data on our cluster
[x] Configure access to cluster for MIT collaborators
[x] Stand up a separate instance of Myria for Big Dog experiment
[x] Stand up a SciDB instance on the same cluster
[ ] Password protect Big Dog Myria (BDM) instance and make it available
[ ] Load Mimic deceased data into BDM

V0.0: Simple pass-through test

[x] Stand up a SciDB instance
[x] Add Jake's Python library to the raco codebase
[x] Add a new logical operator to raco Exec
[x] Add a new algebra/language "Federated" with a single physical operator RunAQL(program)
[x] Implement the Federated language; implement RunAQL using Jake's library and the SciDB instance.
[x] Write a test: A python program that constructs an RA plan by hand with a single exec operator, then translate it to Federated, then run it.

V0.1 A minimal support for AQL in MyriaL.

[x] Add a minimal escape syntax in MyriaL for AQL "%AQL %"


[x] Test a MyriaL program that includes only an escaped AQL query
[ ] Add operators to Federated: RunMyria, MoveSciDBToMyria, MoveMyriaToSciDB.  Stub out the move operators.
[x] Add rules to convert RA fragments to a RunMyria operator, and Exec to RunAQL.
[x] Implement the operators in the Federated language: RunMyria should compile to json and issue the query, waiting for the result.  RunAQL just works as in V0.0.
[x] Test this with a simple MyriaL program -- no communication yet, just the fact that one MyriaL program can run stuff independently on both Myria and SciDB


The MyriaL program will look something like this (is this IPython? Can we use IPython?  Should we demo this in IPython):
Myria is MyriaConnector(masternode, "MyriaL")
%%Default=Myria

AQL is SciDB(hostname, "aql")

X = SELECT * FROM Table

Y(average:int) = %AQL
SELECT avg ( S.v1 )
  FROM Simple_Array S 
 WHERE S.v3 = ‘Odd’ 
 GROUP BY S.I;
%

[ ] Get this deployed so that it's visible from the web as an example.

V0.2: Python-level communication

[ ] Implement MoveMyriaToSciDB: Use Jake's library + shim to load data into SciDB if possible. 
[ ] Implement MoveSciDBToMyria: Pull data from SciDB and load it into Myria using the REST interface. The data will be passed through Python, so it will be assumed to be small.  That's ok. 
[ ] Extend MyriaL to include syntax for returning results.
[ ] Test this with a MyriaL program that may look something like this:

Z = SELECT count(*) as foo FROM SomeTable 
MoveMyriaToSciDB(Z, "Z")
-- maybe this is eventually automatic, if and when we parse AQL
X = %AQL
SELECT avg ( S.v1 ) as avg
  FROM Simple_Array S, Z       -- does this cross-product work?
 WHERE S.v3 < Z.foo 
%
MoveSciDBtoMyria(X, "X")
Y = SELECT count(*) FROM Table WHERE val < X.avg
The physical plan will look something like this:
RunMyria(plan, RunAQL("SELECT..."))
V0.3: Move large data around by allowing MyriaX to communicate directly with SciDB

[ ] Implement a Myria operator PullFromSciDB, maybe as a subclass of Scan.
[ ] Add a rule to generate it in some hardcoded special case
[ ] Implement a Myria operator PushToSciDB, maybe as a subclass of Store.  It may need to shell copy files around, then call a load interface.
(Three paths here: 1) Push over http (small data, but from python), 2) use the bulk loader (gin up binary files, shell them over to the SciDB, and call the loader), 3) Every myria worker builds one small part of the final array, then use method 2), 4) Maybe JDBC?
[ ] Add a rule to generate it in some hardcoded special case

V0.4: Compile a pure MyriaL program to Federated

[ ] Add a rule to Federated to take a branch of an RA plan and generate AQL.  (AQL is basically SQL, so this is easy). The rule can hardcode the decision on which branch to pick.  Ultimately  this will be an optimization decision.
Translate a branch of the RA plan into a RunAQL operator.  
[ ] Test this.


    
            
            
                7andrew7
                commented
                 10 years ago            
            
                @billhowe @mbalazin @jakevdp @ryanmaas @dhalperi
I need some help to further refine these ideas into a more concrete spec.  In particular, integrating this federated data model into the web UI is more complicated than I realized.
The current web UI (https://demo.myria.cs.washington.edu/editor), has panes for an editor, queries, and datasets.  All of these are hard-wired to a single Myria instance.  For the purposes of this demo, much of the UI will need to be revised and expanded:

The editor pane stays mostly the same; the query language itself will be expanded to support a mix of AQL and SQL in the manner described above.
The query pane must change to incorporate the federated data model.  What happens when a "query" consists of a mix of different languages targeting different backends?  Do we generate one query per language/backend or a single result? What is the "status" of such a query?  Do we need a new catalogue to keep track of query information that isn't maintained by one of the backend systems?
Likewise, the dataset pane must change to reflect the federated data model.  Do we generate one result per language+query or a single result per query? How does the system even learn about AQL results?  Do we allow for exporting AQL datasets as CSV, JSON, etc. Do we need a new catalogue to track data sets across the various backends?

Another option is to forego this complexity by designing a simpler web UI -- ipython notebook was mentioned as an example here.  But, there are still basic problems here that we haven't hashed out.  What happens when a user clicks go on a combined SQL+AQL query?  Where do the results go, and how does the user see them?
            
        
            
            
                billhowe
                commented
                 10 years ago            
            
                Here's my thought.
I think we only need or want V0 for now.
V0:

Assume that there will be a custom UI for the demo, one that we may not
build. So we're free to do whatever is easiest to show that things work.
When the query is finished, show a link that, when clicked, will call
SciDB to fetch the dataset via http/shim. Jake can show how to do this.

V1:

Figure out how to ask SciDB for its catalog. The catalog is in postgres,
I believe.  Jake may know, or Emad will almost certainly know.
Add another tab to the  UI called "SciDB Datasets" and rename to existing
tab "Myria Datasets"
Populate the SciDB datasets tab with the catalog on demand, and make each
dataset a live link to download it.

V2:

Some kind of federated catalog

On Mon, Jun 16, 2014 at 3:33 PM, Andrew Whitaker notifications@github.com
wrote:

@billhowe https://github.com/billhowe @mbalazin
https://github.com/mbalazin @jakevdp https://github.com/jakevdp
@ryanmaas https://github.com/ryanmaas @dhalperi
https://github.com/dhalperi
I need some help to further refine these ideas into a more concrete spec.
In particular, integrating this federated data model into the web UI is
more complicated than I realized.
The current web UI (https://demo.myria.cs.washington.edu/editor), has
panes for an editor, queries, and datasets. All of these are hard-wired to
a single Myria instance. For the purposes of this demo, much of the UI will
need to be revised and expanded:

The editor pane stays mostly the same; the query language itself
will be expanded to support a mix of AQL and SQL in the manner described
above.
The query pane must change to incorporate the federated data model.
What happens when a "query" consists of a mix of different languages
targeting different backends? Do we generate one query per language/backend
or a single result? What is the "status" of such a query? Do we need a new
catalogue to keep track of query information that isn't maintained by one
of the backend systems?
Likewise, the dataset pane must change to reflect the federated
data model. Do we generate one result per language+query or a single result
per query? How does the system even learn about AQL results? Do we allow
for exporting AQL datasets as CSV, JSON, etc. Do we need a new catalogue to
track data sets across the various backends?

Another option is to forego this complexity by designing a simpler web UI
-- ipython notebook was mentioned as an example here. But, there are still
basic problems here that we haven't hashed out. What happens when a user
clicks go on a combined SQL+AQL query? Where do the results go, and how
does the user see them?
—
Reply to this email directly or view it on GitHub
https://github.com/uwescience/raco/issues/216#issuecomment-46246622.
            
        
            
            
                7andrew7
                commented
                 10 years ago            
            
                Some notes from today's meeting with @7andrew7, @mbalazin, @ryanmaas, @jakevdp, and Emad.  Overall, this doesn't radically diverge from Bill's vision described above.
We proposed a very simple version 0.  The goal here is to run a single query that combines SQL with AQL in a very simplistic way.  In particular, the SQL and AQL don't interact at all.  As an example:
X = %sql select * from foo where foo.x > 30;
Y = %aql select * from bar where bar.y < 25;

store(X, andrew:myprogram:SqlOut);
store(Y, andrew:myprogram:AqlOut);
Deliverable: run this query from the web UI. Validate by hand that the results appear in the appropriate backend databases..  This is even dumber than the v0 proposed by @billhowe .
v1: Show AQL datasets and query results from the web UI.  Create a python API for querying scidb's catalogue and for fetching results.  Expose these results via the web UI.
Deliverable: Expose scidb results via a new tab on the web UI.  Strip away all tabs that are not relevant for our demo.
v2: Create an integrated example, where the myrial/sql query interacts with an aql query.  Andrew proposed a syntax like this:
X = %sql select * from foo where foo.x > 30;
Y = %aql select * from bar where bar.y < 25;

YR = AsRelation(Y, x:int, y:int, z:int);
J = %sql select X.x, YR.y where X.x=YR.y;
Store(J, andrew:myprogram:Output);
There are various implementation options here, and we didn't decide on a single right answer.  There is a research task to explore the set of export options that scidb supports.
            
        
            
            
                billhowe
                commented
                 10 years ago            
            
                Looks good.
Is %sql the same as MyriaL? Or, can there be a %myrial?   We should
demonstrate the fact that we already have a language.  I don't want it to
look like we hacked up everything on the fly.
Let's update the original checklist with this material, though -- the
checklist should reflect the current plan.
Specifically:

The new v0 is analogous to the original V0.1.  But we still need to start
with V0.0, which does not require any interaction on the web at all.  In
particular, we don't want to spend time changing the parser or the web UI
until we prove that we can run queries against SciDB from plain 'ol Python
(via raco).
We still want the "Federated" language inside raco, unless there is a
better suggestion on how specifically to implement this.   The
MyriaLanguage should not be polluted with AQL support.  (But I can imagine
other ways -- just need to be concrete, make a plan, and see it through)
The new v1 (returning results to the web) is indeed new, and important --
let's add it as V0.15 maybe?
The new v2 depends on having something like the existing V0.3, so we
can't skip that.
The new v2 is harder than the existing versions, as it involves a join
across systems instead of an explicit move.  Maybe replace the existing
V0.4 with this.  But I'd recommend doing the simplest possible thing before
doing an implicit cross-system join.

If this all makes sense, I can edit the checklist.
On Tue, Jun 17, 2014 at 5:51 PM, Andrew Whitaker notifications@github.com
wrote:

Some notes from today's meeting with @7andrew@, @mbalazin
https://github.com/mbalazin, @ryanmaas https://github.com/ryanmaas,
@jakevdp https://github.com/jakevdp, and Emad. Overall, this doesn't
radically diverge from Bill's vision described above.
We proposed a very simple version 0. The goal here is to run a single
query that combines SQL with AQL in a very simplistic way. In particular,
the SQL and AQL don't interact at all. As an example:
X = %sql select * from foo where foo.x > 30;
Y = %aql select * from bar where bar.y < 25;
store(X, andrew:myprogram:SqlOut);
store(Y, andrew:myprogram:AqlOut);
Deliverable: run this query from the web UI. Validate by hand that the
results appear in the appropriate backend databases.. This is even dumber
than the v0 proposed by @billhowe https://github.com/billhowe .
v1: Show AQL datasets and query results from the web UI. Create a python
API for querying scidb's catalogue and for fetching results. Expose these
results via the web UI.
Deliverable: Expose scidb results via a new tab on the web UI. Strip away
all tabs that are not relevant for our demo.
v2: Create an integrated example, where the myrial/sql query interacts
with an aql query. Andrew proposed a syntax like this:
X = %sql select * from foo where foo.x > 30;
Y = %aql select * from bar where bar.y < 25;
YR = AsRelation(Y, x:int, y:int, z:int);
J = %sql select X.x, YR.y where X.x=YR.y;
Store(J, andrew:myprogram:Output);
There are various implementation options here, and we didn't decide on a
single right answer. There is a research task to explore the set of export
API options that scidb supports.
—
Reply to this email directly or view it on GitHub
https://github.com/uwescience/raco/issues/216#issuecomment-46384267.
            
        
            
            
                7andrew7
                commented
                 10 years ago            
            
                Yes, sql is myrial.  We can support both tags/labels.

The new v0 is analogous to the original V0.1. But we still need to start

with V0.0, which does not require any interaction on the web at all. In
particular, we don't want to spend time changing the parser or the web UI
until we prove that we can run queries against SciDB from plain 'ol Python
(via raco).
Currently, raco doesn't connect to anything -- it's a language
parser/compiler.  For example, we don't connect to myria directly from
raco.    raco is a standalone entity, and I'm not wild about incorporating
back-end specific connection logic.




We still want the "Federated" language inside raco, unless there is a
better suggestion on how specifically to implement this. The
MyriaLanguage should not be polluted with AQL support. (But I can imagine
other ways -- just need to be concrete, make a plan, and see it through)


I'm vague on how this will actually work.  I was assuming that we wanted
some common algebra where relational operators and AQL operators / queries
co-exist.  Otherwise, how will we ever express a program that combines both
back ends?
To be more precise, what is the output of compiling a program with a
combination of sql/myrial and aql?  I was envisioning a single top-level
sequence or parallel operator.  The top-level operator would have a
combination of relational operators and scidb operators.  The latter is
just a wrapper around an AQL query since we don't currently know how to
parse this language.
Input program (combination of Myrial + AQL) ==> Common algebra
I suspect you have some other mental model in mind.

The new v2 is harder than the existing versions, as it involves a join

across systems instead of an explicit move. Maybe replace the existing
V0.4 with this. But I'd recommend doing the simplest possible thing before
doing an implicit cross-system join.


Fair enough.  Honestly, I'm not sure how to implement even the simplest
possible query that involves coordination across backends.  Where does the
coordination happen?  It can't happen on appengine given its timeout
restrictions.  So, there may need to be some sort of meta-coordinator that
takes a single query and decomposes it across the federation.  I'm worried
about lots of hidden complexity here.
-- andrew
            
        
            
            
                billhowe
                commented
                 10 years ago            
            
                Here's what I had in mind:
https://github.com/uwescience/raco/tree/bigdog
More clarifications below.
On Tue, Jun 17, 2014 at 10:23 PM, Andrew Whitaker notifications@github.com
wrote:

Yes, sql is myrial. We can support both tags/labels.

The new v0 is analogous to the original V0.1. But we still need to start

with V0.0, which does not require any interaction on the web at all. In
particular, we don't want to spend time changing the parser or the web UI
until we prove that we can run queries against SciDB from plain 'ol
Python
(via raco).
Currently, raco doesn't connect to anything -- it's a language
parser/compiler. For example, we don't connect to myria directly from
raco. raco is a standalone entity, and I'm not wild about incorporating
back-end specific connection logic.



I'm not sure where you're proposing this logic reside.  MyriaWeb only?
I think the Federated algebra may need to be aware of the systems it is
federating.
So why an algebra?  We want to reason about federated plans the same way we
reason about other plans.  There are operators to move data back and forth,
run queries, etc.



We still want the "Federated" language inside raco, unless there is a
better suggestion on how specifically to implement this. The
MyriaLanguage should not be polluted with AQL support. (But I can imagine
other ways -- just need to be concrete, make a plan, and see it through)


I'm vague on how this will actually work. I was assuming that we wanted
some common algebra where relational operators and AQL operators / queries
co-exist. Otherwise, how will we ever express a program that combines both
back ends?

https://github.com/uwescience/raco/tree/bigdog

To be more precise, what is the output of compiling a program with a
combination of sql/myrial and aql?

Doesn't necessarily need to be compiled to an ascii string -- just need
some object that we can execute.

I was envisioning a single top-level
sequence or parallel operator. The top-level operator would have a
combination of relational operators and scidb operators. The latter is
just a wrapper around an AQL query since we don't currently know how to
parse this language.

Exactly.  But we need to map the logical plan down to a physical plan
before running it.  That's what the Federated algebra is for -- a target
for this mapping.

Input program (combination of Myrial + AQL) ==> Common algebra
Eventually, but this is not on the immediate critical path.
I suspect you have some other mental model in mind.

I think we mostly agree.


The new v2 is harder than the existing versions, as it involves a join

across systems instead of an explicit move. Maybe replace the existing
V0.4 with this. But I'd recommend doing the simplest possible thing
before
doing an implicit cross-system join.


Fair enough. Honestly, I'm not sure how to implement even the simplest
possible query that involves coordination across backends. Where does the
coordination happen? It can't happen on appengine given its timeout
restrictions. So, there may need to be some sort of meta-coordinator that
takes a single query and decomposes it across the federation. I'm worried
about lots of hidden complexity here.

Don't worry about appengine restrictions. Let's get it working from Python
first; that's why V0.1 exists.
You can poll from the client in RESTful situations.
(However, it could be a problem is SciDB only allows synchronous
communication through its http interface.  @jakevdp or @esoroush ?)

-- andrew
—
Reply to this email directly or view it on GitHub
https://github.com/uwescience/raco/issues/216#issuecomment-46397175.
            
        
            
            
                7andrew7
                commented
                 10 years ago            
            
                Yes, I was proposing to embed the query dispatch logic in myria-web.  This is how we currently execute queries, and it seemed natural to keep the same basic structure.  Another alternative is to create a bigdog repository that comprises raco, scidb, myria-python, etc.  I would prefer this to forcing raco to take on a dependency for every backend.
Even if the logic resides within raco, I don't like the idea of overloading the optimization framework to execute queries.  Also, this approach will conflict with the outstanding changes of @dhalperi to the optimization framework in #248.  Dan changes the interface to optimize to accept a single top-level operator (sequence or parallel).
We basically agree on the raco changes, and I can incorporate those without too much effort.
            
        
            
            
                billhowe
                commented
                 10 years ago            
            
                On Wed, Jun 18, 2014 at 7:53 PM, Andrew Whitaker notifications@github.com
wrote:

Yes, I was proposing to embed the query dispatch logic in myria-web. This
is how we currently execute queries, and it seemed natural to keep the same
basic structure. Another alternative is to create a bigdog repository that
comprises raco, scidb, myria-python, etc. I would prefer this to forcing
raco to take on a dependency for every backend.
We generate code that only Myria/Grappa understands.  I figure we've
already eaten a dependency.
Even if the logic resides within raco, I don't like the idea of
overloading the optimization framework to execute queries. Also, this
approach will conflict with the outstanding changes of @dhalperi
https://github.com/dhalperi to the optimization framework in #248
https://github.com/uwescience/raco/pull/248. Dan changes the interface
to optimize to accept a single top-level operator (sequence or parallel).
I don't see a conflict; #248 seems great.

Somewhere we have to implement the cross-system Sequence operator that says
"Do X in Myria, then do Y in SciDB."
I think this logic should be in a new "Federated" algebra.
I don't think it makes sense to do this in MyriaWeb, because we want to run
queries sourced from somewhere other than the browser.
But in any case, let's get something working and then refactor.

We basically agree on the raco changes, and I can incorporate those
without too much effort.
—
Reply to this email directly or view it on GitHub
https://github.com/uwescience/raco/issues/216#issuecomment-46518610.
            
        
            
            
                7andrew7
                commented
                 10 years ago            
            
                As a compromise, I created a new repository to house the mimic demo logic.  Below is a script that parses a Myrial/AQL program and extracts the AQL fragments.  I'm working with @jakevdp and @esoroush to make this actually connect to scidb.
https://github.com/uwescience/mimic/blob/master/run.py
            
        
            
            
                7andrew7
                commented
                 10 years ago            
            
                @billhowe  @jakevdp  I hacked Bill's bigdog branch to actually connect to scidb; I think this completes the v0.0 test.  Caveats: the "AQL" language is actually "AFL", as per our previous discussion.
https://github.com/uwescience/raco/compare/bigdog
            
        
            
            
                billhowe
                commented
                 10 years ago            
            
                GREAT!!!!
I think you also have most of 0.1 done, as well?
(And one of the undone items in 0.1 probably shouldn't even be there -- the
"MoveToX" operators don't need to exist quite yet.)
I checked off the items  that I think are complete on v0.1.
I think the next step, after SIGMOD demos, is to get an example of AFL and
MyriaL, together, working from the web.   We can use Dan's separate
deployment onc we turn it back on (again, after SIGMOD)
On Sun, Jun 22, 2014 at 9:43 AM, Andrew Whitaker notifications@github.com
wrote:

@billhowe https://github.com/billhowe @jakevdp
https://github.com/jakevdp I hacked Bill's bigdog branch to actually
connect to scidb; I think this completes the v0.0 test. Caveats: the "AQL"
language is actually "AFL", as per our previous discussion.
https://github.com/uwescience/raco/compare/bigdog
—
Reply to this email directly or view it on GitHub
https://github.com/uwescience/raco/issues/216#issuecomment-46785910.
            
        
            
            
                dhalperi
                commented
                 10 years ago            
            
                Cool stuff!
            
        
    
    
            

    
        
            ©  Githubissues.
            Githubissues is a development platform for aggregating issues.