jjjake / internetarchive

A Python and Command-Line Interface to Archive.org
GNU Affero General Public License v3.0
1.61k stars 218 forks source link

upload_file() calculates checksums even if checksum isn't used. #77

Closed konklone closed 4 years ago

konklone commented 9 years ago

I'm trying to upload a large file, and the extreme slowness before the uploader begins sending data suggests to me that it's either reading the whole file into memory, or at least reading over it once before beginning.

It seems like it might be a problem here, when it looks to calculate a size hint for the API:

body.seek(0, os.SEEK_END)
size = body.tell()
body.seek(0, os.SEEK_SET)

Python's os module has ways of fetching the size of a given file path that can read it directly from filesystem metadata, rather than seeking through the file.

It also looks like a problem here, when an MD5 is calculated for the entire file:

md5_sum = utils.get_md5(body)

Providing a way to disable the MD5 check, or multipart uploads where an MD5 only has to be taken per-chunk, would help mitigate this.

It looks like multipart uploads got ditched in the switch from boto to requests. Is restoring support for multipart uploads a planned feature?

aaroncohen commented 9 years ago

Out of curiosity, I took a peek at this.

Seeking to the end of a file is unlikely to be a time consuming operation. It also appears to be necessary, as it's just the file body handle that's being passed around...I'd have to dive deeper, but I think the lib doesn't make the assumption that it has to be a file on disk.

The md5_sum is definitely reading the whole file, but the implementation used is such that we're likely to be disk-bound, not CPU-bound. There's not much room to improve it, short of disabling the MD5 as you suggested.

On the other hand, you're far more likely to run into this issue: https://github.com/kennethreitz/requests/issues/2181

The requests library, it seems, has some performance problems dealing with large files.

konklone commented 9 years ago

On the other hand, you're far more likely to run into this issue: kennethreitz/requests#2181

I'm sure that does slow down my upload, but that wasn't the issue I observed. What I was surprised by was that there was an extensive delay before the file started uploading.

Seeking to the end of a file is unlikely to be a time consuming operation. It also appears to be necessary, as it's just the file body handle that's being passed around...I'd have to dive deeper, but I think the lib doesn't make the assumption that it has to be a file on disk.

Makes sense to me, as long as it's not time consuming for a 34GB file.

The md5_sum is definitely reading the whole file, but the implementation used is such that we're likely to be disk-bound, not CPU-bound. There's not much room to improve it, short of disabling the MD5 as you suggested.

The Internet Archive API supports the S3 Multipart Upload API specification, and one of the main purposes of that API is to avoid ever having to read the entire origin file before beginning to upload. This IA library should be able to replicate that, so that if I tell internetarchive to upload a 36GB file, the upload begins immediately.

saper commented 9 years ago

Yes, it should be possible to rewrite upload to even use a non-seekable stream if possible. MD5 is not mandatory (we can add a parameter to supply content MD5 if known.) We could make md5 a parameter to be supplied (if you agree to change the API) and mandatory for verify and maybe delete. MD5 can be also calculated during the upload and returned to verify metadata after upload.

I also think we don't need to prepare a request every time we re-try and therefore we can avoid seeking on body altogether.

konklone commented 8 years ago

@jjjake Did this get addressed elsewhere? (I haven't followed the whole repo closely.)

saper commented 8 years ago

Not quite... I can start working on this if that's okay...

jjjake commented 8 years ago

@konklone @saper -- My apologies, I misunderstood the issue here.

@konklone you are correct. upload_file() is calculating a checksum for the file even if checksum is False. This should be an easy fix. utils.get_md5(body) should not be called if checksum, verify and delete are all False.

@saper please feel free to wok on this. I'm happy to work on this as well if you don't have the time.

Thanks for following up @saper and @konklone, I appreciate it! : )

saper commented 8 years ago

Thanks! I am interested in uploading some very large files at some point later, so I'd love to have a possibility to stream them onto ia upload (from stdin for example), so seek will not be available. Started reading this code to get some idea how to fix it, but not sure I understand everything, for example I don't know how Content-MD5 header changes things and what responses are to be expected from API when this is specified. MD5 could be also specified on the command line if we are streaming the file and we need to verify (if I understand correctly how verify is supposed to work).

jjjake commented 8 years ago

@saper The Content-MD5 header should be set to the md5 checksum of the file being uploaded (if provided). Once all of the data has been handed off to archive.org, archive.org will calculate an md5 checkusm of the data and check to see if it matches the checkusm in the Content-MD5 header. If it does, archive.org returns a 200 and your item gets created. If they do not match, a 400 status is returned and no data is saved. For example:

» curl --location --header 'x-amz-auto-make-bucket:1' \                                                                                                                                                                    2016-01-11 19:33:12
       --header 'Content-MD5:bad md5' \
       --header 'x-archive-meta-mediatype:texts' \
       --header 'x-archive-meta-collection:test_collection' \
       --header 'x-archive-meta-language:eng' \
       --header "authorization: LOW $access:$secret" \
       --upload-file ./1866_Muller_A928.pdf \
       http://s3.us.archive.org/jake-s3-test-2016-01-11/1866_Muller_A928.pdf
<?xml version='1.0' encoding='UTF-8'?>
<Error><Code>BadDigest</Code><Message>The Content-MD5 you specified did not match what we received.</Message><Resource>content-md5 submitted with PUT: bad md5 != recieved data md5: 9a021f2e21b24a9641a35758cb6f83f5also not equal to base64 version: mgIfLiGySpZBo1dYy2+D9Q==</Resource><RequestId>e1c3950a-752b-4905-9cdf-bd28c1e6702a</RequestId></Error>⏎

See RFC 1864 for more details on the Content-MD5 header.

I would love to get cat data | ia upload - --remote-name file.txt working properly (i.e. stream from stdin rather than read it all into memory). I have not worked with or looked at the upload from stdin code in a while, if you'd like to work on this @saper that would be great! I also love the idea of being able to specify an md5 via an arg. I think in the Python side of things we should check for a Content-MD5 header in upload() and if one exists use that for the md5 rather than calculating one. The CLI should have an --md5 option that will be handed off to upload() as a Content-MD5 header.

JustAnotherArchivist commented 4 years ago

@jjjake, looks like 5ba8acb2 fixed the original issue of calculating a hash even if it's not needed. Streaming uploads would also be great (and this is certainly an important step towards it) but should in my opinion be filed as a separate issue, so I'd suggest closing this.