refactor the TupleWriter(s) for distributed use

uwescience / myria

Myria is a scalable Analytics-as-a-Service platform based on relational algebra.

myria.cs.washington.edu

Other

112 stars 46 forks source link

refactor the TupleWriter(s) for distributed use #705

Closed dhalperi closed 8 years ago

dhalperi commented 9 years ago

We use TupleWriter for format conversion of data during download. Currently, it works like this:

All workers send all tuples to the master
The master uses the TupleWriter to stringify everything.

The problem with this approach is that stringification is slow and all the burden is on the master.

My proposal is to expose "stringify one row" as a function on each worker. Then we can do something like this:

each worker stringifies its rows, and sends the strings to the master
the master combines all the rows together, adding the beginning and end tokens ([ and ] for JSON, or column headers and nothing for CSV/TSV), and writes it to stream

This will better spread out the load and (dramatically?) improve download speeds.

References:

TupleWriter interface; CsvTupleWriter and JsonTupleWriter implementations
DatasetResource - uses TupleWriter to serialize data to client

dhalperi commented 9 years ago

Note that an advanced form of this idea would support downloading things in order. Suppose table x,y,z and I want SELECT * FROM table ORDER BY y ASC, z DESC. The plan might looks like this:

on workers:

Scan(table)
OrderBy(y+, z-)
Apply(y, z, s=stringify(x,y,z)) -- drop x
Collect at master

at master:

Merge based on y+,z- (depends on #348)
Apply(s) -- drop y and z
Finish stringify using refactored TupleWriter
Write to output stream back to client.

dhalperi commented 9 years ago

Fyi @bmyerz – this is the description of the pre-req to parallel write-to-disk. Right now all serialization happens at the master; if we make it happen at the workers than it will be easy to parallel dump to HDFS

senderista commented 8 years ago

I question the present relevance of this approach given its complexity and the fact that we are close to implementing parallel export to S3. When that feature is finished, datasets currently too large for our HTTP download API can be exported in parallel to S3 and downloaded in parallel from S3 (using the AWS CLI or similar tools). I don't see a good use case for streaming downloads of large datasets anyway.

@jingjingwang @jortiz16 any comments?

jortiz16 commented 8 years ago

I also don't see a good use case at the moment either.

With respect to exporting to S3, we already have that capability in MyriaX but we're not exposing it anywhere. If the user launches a myria cluster with an S3 role, we should enable some easy way to export (perhaps through MyriaL).

senderista commented 8 years ago

S3 export capability is implemented in UriSink. To expose it to users, I guess we need to annotate DataSink with the JSON subtype Uri -> UriSink and extend MyriaL with the new URI-friendly EXPORT syntax (https://github.com/uwescience/raco/issues/496). Note that exporting to an S3 URI would require S3 credentials on the coordinator, which ideally should correspond to a least-privilege role (like the myria-cluster role under the Myria account).

senderista commented 8 years ago

Re: parallel export from Postgres to HDFS mentioned above, we could look at Sqoop.

senderista commented 8 years ago

Based on discussion above, closing this issue until someone points out a reason to revive it.