sillsdev / machine

Machine is a natural language processing library for .NET that is focused on providing tools for processing resource-poor languages.
MIT License
27 stars 15 forks source link

XML upload for large files broken #128

Closed johnml1135 closed 1 year ago

johnml1135 commented 1 year ago
Build faulted (6541a3243627520aa5115890)
Aborted upload waEQ5At.V4L_nUDNdSPuhMzaFBf3cbvDXoi4ADQv7.rZtTa3_L_YKBL59T2D2olULymjFQgVziH8mkaERNg7Gdq2w0dEcAqMWF2kSEn_nw5E9KesTX3rpAQuOrEKduqL to aqua-ml-data/ext-qa/builds/6541a3243627520aa5115890/train.trg.txt
      Amazon.S3.AmazonS3Exception: The XML you provided was not well-formed or did not validate against our published schema
       ---> Amazon.Runtime.Internal.HttpErrorResponseException: Exception of type 'Amazon.Runtime.Internal.HttpErrorResponseException' was thrown.
         at Amazon.Runtime.HttpWebRequestMessage.GetResponseAsync(CancellationToken cancellationToken)
         at Amazon.Runtime.Internal.HttpHandler`1.InvokeAsync[T](IExecutionContext executionContext)
         at Amazon.Runtime.Internal.RedirectHandler.InvokeAsync[T](IExecutionContext executionContext)
         at Amazon.Runtime.Internal.Unmarshaller.InvokeAsync[T](IExecutionContext executionContext)

It appears there is an issue with the multi-part upload. This may be of help: https://stackoverflow.com/questions/45727244/malformedxml-the-xml-you-provided-was-not-well-formed-or-did-not-validate-again.

Enkidu93 commented 1 year ago

I've encountered this error before. I think this is the error thrown when you give it no source or target files (???). I'll verify. If so, just wrap it and throw something more meaningful?

johnml1135 commented 1 year ago

Note that all three files failed to upload with the same error: train.src.txt, train.trg.txt and pretranslate.src.json.

Enkidu93 commented 1 year ago

Yes, this error is the result of submitting a source file that is completely empty. Catch and throw something more descriptive @johnml1135? Or check if the source is empty earlier on?

ddaspit commented 1 year ago

I think we should just handle an empty file properly in S3WriteStream. On dispose if there are no parts uploaded, then abort the multipart upload and call PutObjectAsync with an empty file. Alternatively, only initiate the multipart upload on the first call to WriteAsync. This would avoid the need to abort the multipart upload.