mathworks-ref-arch / matlab-aws-s3

MATLAB interface for AWS S3.
Other
4 stars 1 forks source link

Support for parallel computing #3

Open michael-pont opened 4 years ago

michael-pont commented 4 years ago

Is there support for using parallel computing? My eventual goal is to upload thousands of files in parallel using the parallel toolbox. The following command does not work

s3 = aws.s3.Client();
s3.initialize();
%% Parfor test
bucketLocation = bucketName + "/" + "textractTest";
files = {'secondTest.json', 'otherTest.json', 'testMatlabUpload.json'};
parfor i = 1 : length(files)
   filename = files{i};
   s3.putObject(char(bucketLocation), filename, filename);
   disp(i);
end

Warning: com.amazonaws.ClientConfiguration@2016f509 is not serializable 
Warning: While loading an object of class 'Client':
Invalid default value for property 'clientConfiguration' in class 'aws.s3.Client':
Error: File: ........\mathworks-aws-support\matlab-aws- common\Software\MATLAB\app\system\+aws\@ClientConfiguration\ClientConfiguration.m Line:28 Column: 20
The import statement 'import com.amazonaws.ClientConfiguration' cannot be found or cannot 
be imported. Imported names must end with '.*' or be fully qualified.

Thanks in advance.

michael-pont commented 4 years ago

I am able to use the higher level aws s3 connection discussed here: https://de.mathworks.com/help/matlab/import_export/work-with-remote-data.html I can get the parallel computing to work by letting the workers know my AWS credentials, that is not a problem. I am only able to achieve this using the fileDatastore to read .pdf and .json files. Thus, I can download files in parallel. However, I can not upload any .pdf or .json files as far as I know since that is not supported. Therefore, I found this lower level github repo which I thought would solve my problem. I can now upload all file types easily with the api, however it is necessary to do things in parallel since thousands of files 1 by 1 would take too long.

brownemi commented 4 years ago

You are correct that the low-level API can't be used in parallel in this way. This is documented at: https://github.com/mathworks-ref-arch/matlab-aws-s3/blob/master/Documentation/MATLABDistributed.md

The reason is that the underlying Java SDK client cannot be serialized and will have open network connections etc. that would prevent it being moved/replicated to workers on other systems.

An approach to dealing with this would be use an SPMD block to create and initialize a client on each of the workers, assuming they all have access to all the files and the package is installed on each worker. Then parfor can be used to automatically divide and distribute a list of the file paths to upload/download as you suggest. If you'd like to work through this interactively, please contact us at mwlab@mathworks.com.

michael-pont commented 4 years ago

@brownemi Thank you for responding quickly! Your advice was very helpful. I was able to get it to work with the SPMD block. To get it to work, I had to additionally pass in the environment variables to the workers as described in the high-level api:

setenv("AWS_ACCESS_KEY_ID", char(creds.ACCESS_KEY));
setenv("AWS_SECRET_ACCESS_KEY", char(creds.SECRET_KEY));
setenv("AWS_REGION", "us-east-2");
envVars = ["AWS_ACCESS_KEY_ID", "AWS_SECRET_ACCESS_KEY", "AWS_REGION"];
parpool("local", workers, "EnvironmentVariables", envVars);

Therefore, the system set environment variables aws credentials from running the startup script do not apply anymore?

brownemi commented 4 years ago

You're welcome, I'm glad it helped. The startup script sets paths rather than environment variables so unless they've been permanently configured for your workers you should continue to add them by running the startup script in the SPMD block. You may want to suppress the output in due course, especially with lots of workers. The startup script doesn't share credentials or propagate any environment variables.

As you may be aware, by default this low-level API supports what is referred to as the Credentials Provider Chain whereby it will iterate through 5 different potential sources of credentials or you can override this and use a simple json file too. This is documented in: https://github.com/mathworks-ref-arch/matlab-aws-s3/blob/master/Documentation/Authentication.md

Environment variables are the first potential credential source and can be used as you demonstrate. If you have AWS credentials defined locally as environment variables then a step to share them, like you show, is important as the workers will normally be on different machines if not using a local pool.

I can see one potential problem with this. If you need more than one client or want to change credentials on-the-fly (e.g. if using short-term Security Token Service (STS) tokens that must be refreshed) then using environment variables has limitations. The low-level API calls the AWS Java SDK and the environment variables that the JVM sees are those present when the MATLAB context hosting it is started. Thus changing or creating a variable in the environment using setenv after that point will not take effect inside the JVM.

There are also some obscure use cases whereby the JVM may not be restarted between jobs and so would not inherit the variables, but so long as you are using parpool this does not arise. In such edge cases I'd tend towards using a shared credential file(s) (in either format), assuming the shared file(s) are adequately protected.

I'll review the documentation with all this mind to see how it can be improved. I hope this answers your question.

michael-pont commented 4 years ago

@brownemi Sorry for the late response Thanks for your input. Since there exists an .env file in the project already, I would then take your recommendation to also implement a json credential file since it doesn't come with the drawbacks you described.