Open jieguangzhou opened 3 months ago
How does the download task get handled? Currently we have this job which looks for URIs and downloads then, before any other jobs are triggered. https://github.com/SuperDuperDB/superduperdb/blob/fc9184c41920a04e4370333d710eb6c10bc866ae/superduperdb/base/datalayer.py#L717
Another thing - this won't work if the datatypes are Encodable
or Artifact
. This seems to be in the paradigm of File
.
How does the download task get handled? Currently we have this job which looks for URIs and downloads then, before any other jobs are triggered.
It is no longer treated as a task. For example, when you use data = HttpPredownload("https://superduperdb.com/xxx").dict()
, the data has already been downloaded.
then we got the real file path by calling data['x']
or get the real file path by calling HttpPredownload.encode_data("https://superduperdb.com/xxx")
Another thing - this won't work if the datatypes are
Encodable
orArtifact
. This seems to be in the paradigm ofFile
.
We can also use this logic, depending on how we want to save the specific data, whether as a file or binary data.
How does the download task get handled? Currently we have this job which looks for URIs and downloads then, before any other jobs are triggered. https://github.com/SuperDuperDB/superduperdb/blob/fc9184c41920a04e4370333d710eb6c10bc866ae/superduperdb/base/datalayer.py#L717
It is no longer treated as a task. For example, when you use
data = HttpPredownload("https://superduperdb.com/xxx").dict()
, the data has already been downloaded.then we got the real file path by calling
data['x']
or get the real file path by calling
HttpPredownload.encode_data("https://superduperdb.com/xxx")
But then we miss all of the benefits of the current downloading-task; the multi-threading, the multi-processing.
How does the download task get handled? Currently we have this job which looks for URIs and downloads then, before any other jobs are triggered. https://github.com/SuperDuperDB/superduperdb/blob/fc9184c41920a04e4370333d710eb6c10bc866ae/superduperdb/base/datalayer.py#L717
It is no longer treated as a task. For example, when you use
data = HttpPredownload("https://superduperdb.com/xxx").dict()
, the data has already been downloaded. then we got the real file path by callingdata['x']
or get the real file path by callingHttpPredownload.encode_data("https://superduperdb.com/xxx")
But then we miss all of the benefits of the current downloading-task; the multi-threading, the multi-processing.
We would need to incorporate this into .execute()
or into the cursor?
We create an S3 DataType with a parameter, pre_download.
The logic when pre_download is True and encodable as File
During encoding:
Download the file from S3 and create a
FileEncodable
with the specified download path. After saving the data, the file/folder will be stored in the artifact.During decoding:
Retrieve the file/folder from the artifact.
This logic is similar to the file encodable logic that retrieves files from artifacts.
The logic when pre_download is False:
During encoding:
Return the original S3 path.
During decoding:
Download the data from S3. RemoteData calls a download module, which provides the logics for loading remote files/URIS.
RemoteData
download module
Example
Pre Download
No Pre Download