uskudnik / amazon-glacier-cmd-interface

Command line interface for Amazon Glacier
MIT License
374 stars 100 forks source link

Fixing issue #94 by getting rid of mmap. Now we just seek for required... #99

Closed stigger closed 11 years ago

stigger commented 11 years ago

... data and read it, without having to read the whole file at once or mmap it

wvmarle commented 11 years ago

This is the wrong solution for this issue.

This mitigates the very reason mmap was introduced: not having to read complete parts of the file in memory. This will fail badly as soon as your part size is greater than the amount of memory available, and in all other situations uses more memory than strictly necessary.

The solution lies in only mmapping the required part of the file.

stigger commented 11 years ago

The code that can upload large files by small parts is better, than code that can't upload large files at all.

The existing mmap implementation doesn't work for me at all -- I couldn't upload a 4G file with 2G RAM, exactly because it tried to mmap a file larger than available memory. There were no quick solutions for #94, so I implemented the quickest and easiest fix that came to mind. Also, it's very unlikely that I'm going to upload parts larger than amount of available RAM.

This might not be an ideal solution, but it's not "wrong" -- it provides the only functional (in my configuration) version of glacier-cmd. I'm sure that I'm not the only who needs to upload files larger than RAM.

wvmarle commented 11 years ago

The bug is trying to mmap the complete file in one go elsewhere which fails fro files > 2GB irrespective of physical memory.

stigger commented 11 years ago

OK, but that's not the point. The point is that code in /master is unusable, and this commit makes it usable.

wvmarle commented 11 years ago

mmap was introduced because the existing code was unusable for many. Actually the older versions managed to load like five times a part in memory (with at the time a fixed 128 MB size that was a big issue as you can imagine), that was really crazy. Without mmap there will be issues on systems with little RAM such as SAN servers.

stigger commented 11 years ago

Give me an example of situation in which current /master code will work, and my will fail. I can think of exactly one: file is less than 2gigs and part sizes are larger than available memory. And even in that case my code will fail, but it's still possible to make it work by lowering part sizes.

Look, all I wanted to do is upload my video archive to Glacier. I took the code from /master (which is supposed to be the latest stable version with applied hotfixes), and couldn't use it until I implemented 5 fixes.

If you have the proper solution for this problem — great, let's merge it into /master, if you don't — this, in my opinion, is a compromise that will work for most of the people. I understand that it might not work very well with large part sizes, but 128M is not the default anymore, so probably we shouldn't worry about that.

echamberlain commented 11 years ago

Stigger,

I'm also trying to use glacier to archive time lapse 4K footage (17MB per frame) and my uploads are in the gigabytes range.

I agree there are annoying bugs to work out (exceptions that require upload restarts), but lets not get bogged down in source management issues. This is a community open source project, expecting master to be the latest stable version is expecting a bit much, especially when it is a 0.2-dev release. Personally, I am thankful the developers saved me a weekend of coding and I also appreciate your efforts to fix bugs.

stigger commented 11 years ago

Every properly managed community open source project has the latest stable version in /master. "Stable" here means "without in-development features, but with hotfixes applied".

Source management issues are not just source management issues, they produce other issues. For instance, I checked out the /master, found a bug. After a little bit of looking around, I've found out the following: 1) Last commit into this repository was almost two weeks ago 2) There are no bugfix pull requests

So, I went ahead and fixed the bug (#101), only to find out that it's already has been fixed 3 weeks ago! (de95be6) Only it's been buried under a ton of other commits. The result: the issue is known, a fix is implemented 3 weeks ago, but the /master branch still has a bug in it because of poor source management approach. And what if the project maintainer decides not to merge in that feature branch?…

It all comes down to what the developers are doing here -- if they just develop a tool for themselves and publish the source code here, then my arguments here are unreasonable, but if they are doing a community open source project that's being used by many people and involves community developers, then they have to also think about community, make it convenient to use their code and contribute by other people.

The bottom line:

offlinehacker commented 11 years ago

Tomorrow me and @uskudnik will have sprint and will be probably migrating upload code on boto implementation and possibly upgrading boto with our ideas(like mmap).

Then tests will be written and tested and release will be made.

We will also write live tests. I will probably try to emulate upload/download aspects of glacier and inject random errors, but it looks kinda difficult.

Please wait for around 2 weeks for a relese. Until then code is still considered in development stage and if it works for someone and does not for somebody else it's not good. On Nov 6, 2012 8:37 PM, "Vyacheslav Karpukhin" notifications@github.com wrote:

Every properly managed community open source project has the latest stable version in /master. "Stable" here means "without in-development features, but with hotfixes applied".

Source management issues are not just source management issues, they produce other issues. For instance, I checked out the /master, found a bug. After a little bit of looking around, I've found out the following: 1) Last commit into this repository was almost two weeks ago 2) There are no bugfix pull requests

So, I went ahead and fixed the bug (#101https://github.com/uskudnik/amazon-glacier-cmd-interface/issues/101), only to find out that it's already has been fixed 3 weeks ago! (de95be6https://github.com/uskudnik/amazon-glacier-cmd-interface/commit/de95be6) Only it's been buried under a ton of other commits. The result: the issue is known, a fix is implemented 3 weeks ago, but the /master branch still has a bug in it because of poor source management approach. And what if the project maintainer decides not to merge in that feature branch?…

It all comes down to what the developers are doing here -- if they just develop a tool for themselves and publish the source code here, then my arguments here are unreasonable, but if they are doing a community open source project that's being used by many people and involves community developers, then they have to also think about community, make it convenient to use their code and contribute by other people.

The bottom line:

  • /master should contain all implemented fixes
  • source management is extremely important if you want other (not core ) developers to contribute.

    — Reply to this email directly or view it on GitHubhttps://github.com/uskudnik/amazon-glacier-cmd-interface/pull/99#issuecomment-10124442.

uskudnik commented 11 years ago

A short report:

Well, we haven't gotten to release yet - @offlinehacker hacked around mmap but IIRC still hasn't completed everything.

I have SNS support almost done, shouldn't be long before it's finished and I make a pull request - RFC https://github.com/uskudnik/amazon-glacier-cmd-interface/issues/104.

We checked tests a bit but haven't made any progress on that front - boto does a bit of testing, but we have no method to test files that cause problems (multi gigabit) on an automated scale.

offlinehacker commented 11 years ago

Well i realized a good written, tested and parallel upload routine has top priority, so i will have it ready till the end of the week. On Nov 9, 2012 2:26 AM, "Urban Škudnik" notifications@github.com wrote:

A short report:

Well, we haven't gotten to release yet - @offlinehackerhttps://github.com/offlinehackerhacked around mmap but IIRC still hasn't completed everything.

I have SNS support almost done, shouldn't be long before it's finished and I make a pull request - RFC uskudnik/amazon-glacier-cmd-interface#104https://github.com/uskudnik/amazon-glacier-cmd-interface/issues/104 .

We checked tests a bit but haven't made any progress on that front - boto does a bit of testing, but we have no method to test files that cause problems (multi gigabit) on an automated scale.

— Reply to this email directly or view it on GitHubhttps://github.com/uskudnik/amazon-glacier-cmd-interface/pull/99#issuecomment-10212477.

offlinehacker commented 11 years ago

Btw @uskudnik i think you can merge this request if it's fixing uploads, it's better than having version that crashes. My version is major rewrite and reorganization anyway. On Nov 9, 2012 9:02 AM, "Jaka Hudoklin" jakahudoklin@gmail.com wrote:

Well i realized a good written, tested and parallel upload routine has top priority, so i will have it ready till the end of the week. On Nov 9, 2012 2:26 AM, "Urban Škudnik" notifications@github.com wrote:

A short report:

Well, we haven't gotten to release yet - @offlinehackerhttps://github.com/offlinehackerhacked around mmap but IIRC still hasn't completed everything.

I have SNS support almost done, shouldn't be long before it's finished and I make a pull request - RFC uskudnik/amazon-glacier-cmd-interface#104https://github.com/uskudnik/amazon-glacier-cmd-interface/issues/104 .

We checked tests a bit but haven't made any progress on that front - boto does a bit of testing, but we have no method to test files that cause problems (multi gigabit) on an automated scale.

— Reply to this email directly or view it on GitHubhttps://github.com/uskudnik/amazon-glacier-cmd-interface/pull/99#issuecomment-10212477.

uskudnik commented 11 years ago

Indeed - since you have it so far already I merged this one in for the time being.