Closed dhalperi closed 8 years ago
Note that an advanced form of this idea would support downloading things in order. Suppose table x,y,z
and I want SELECT * FROM table ORDER BY y ASC, z DESC
. The plan might looks like this:
on workers:
at master:
Fyi @bmyerz – this is the description of the pre-req to parallel write-to-disk. Right now all serialization happens at the master; if we make it happen at the workers than it will be easy to parallel dump to HDFS
I question the present relevance of this approach given its complexity and the fact that we are close to implementing parallel export to S3. When that feature is finished, datasets currently too large for our HTTP download API can be exported in parallel to S3 and downloaded in parallel from S3 (using the AWS CLI or similar tools). I don't see a good use case for streaming downloads of large datasets anyway.
@jingjingwang @jortiz16 any comments?
I also don't see a good use case at the moment either.
With respect to exporting to S3, we already have that capability in MyriaX but we're not exposing it anywhere. If the user launches a myria cluster with an S3 role, we should enable some easy way to export (perhaps through MyriaL).
S3 export capability is implemented in UriSink
. To expose it to users, I guess we need to annotate DataSink
with the JSON subtype Uri
-> UriSink
and extend MyriaL with the new URI-friendly EXPORT
syntax (https://github.com/uwescience/raco/issues/496). Note that exporting to an S3 URI would require S3 credentials on the coordinator, which ideally should correspond to a least-privilege role (like the myria-cluster
role under the Myria account).
Re: parallel export from Postgres to HDFS mentioned above, we could look at Sqoop.
Based on discussion above, closing this issue until someone points out a reason to revive it.
We use
TupleWriter
for format conversion of data during download. Currently, it works like this:TupleWriter
to stringify everything.The problem with this approach is that stringification is slow and all the burden is on the master.
My proposal is to expose "stringify one row" as a function on each worker. Then we can do something like this:
[
and]
for JSON, orcolumn headers
and nothing for CSV/TSV), and writes it to streamThis will better spread out the load and (dramatically?) improve download speeds.
References: