Build a file cache. - Githubissues

cmatthew commented 11 years ago

Make a local file cache so that files are not downloaded more than they need to be. Traits of the file cache are that it should not fill the disk totally, and it should be able to check the integrity of the files since sometimes downloads are corrupted and we dont want those in the cache, and have some smart eviction policy (LRU?). It also needs to be able to be accessed by more than one process so that two workers can run on the same machine at the same time.

One thing we might want to look at is having md5 sums or something similar for each file stored in the database, so we can verify the downloaded files.

This should be easy to build and test separately from the rest of the system!

cmatthew commented 11 years ago

This is probably also a good task to think about minimum VM size, since it will directly impact this,

rickmcgeer commented 11 years ago

I was thinking about this, but missed the corruption part. Seems to me we check that on the download. The pseudo-code looks something like this

if file in cache: touch file # update the time for LRU return else: ok = False while not ok: download file ok = checksum_ok(file) used = du -s cache directory if (used == max_size): sort files in cache by modified time (or just pick the oldest -- O(n) vs O(n log n) toss out the oldest put the new file in the cache

On Wed, Feb 13, 2013 at 11:31 PM, Chris Matthews notifications@github.comwrote:

Make a local file cache so that files are not downloaded more than they need to be. Traits of the file cache are that it should not fill the disk totally, and it should be able to check the integrity of the files since sometimes downloads are corrupted and we dont want those in the cache, and have some smart eviction policy (LRU?). It also needs to be able to be accessed by more than one process so that two workers can run on the same machine at the same time.

One thing we might want to look at is having md5 sums or something similar for each file stored in the database, so we can verify the downloaded files.

This should be easy to build and test separately from the rest of the system!

— Reply to this email directly or view it on GitHubhttps://github.com/rickmcgeer/TransCloud/issues/16.

cmatthew commented 11 years ago

That looks good to me. Im sure we wont be having more than a few hundred files, probably more like 10s on the smaller nodes, so sorting won't be too bad.

On Thu, Feb 14, 2013 at 10:12 AM, rickmcgeer notifications@github.comwrote:

I was thinking about this, but missed the corruption part. Seems to me we check that on the download. The pseudo-code looks something like this

if file in cache: touch file # update the time for LRU return else: ok = False while not ok: download file ok = checksum_ok(file) used = du -s cache directory if (used == max_size): sort files in cache by modified time (or just pick the oldest -- O(n) vs O(n log n) toss out the oldest put the new file in the cache

On Wed, Feb 13, 2013 at 11:31 PM, Chris Matthews notifications@github.comwrote:

Make a local file cache so that files are not downloaded more than they need to be. Traits of the file cache are that it should not fill the disk totally, and it should be able to check the integrity of the files since sometimes downloads are corrupted and we dont want those in the cache, and have some smart eviction policy (LRU?). It also needs to be able to be accessed by more than one process so that two workers can run on the same machine at the same time.

One thing we might want to look at is having md5 sums or something similar for each file stored in the database, so we can verify the downloaded files.

This should be easy to build and test separately from the rest of the system!

— Reply to this email directly or view it on GitHub< https://github.com/rickmcgeer/TransCloud/issues/16>.

— Reply to this email directly or view it on GitHubhttps://github.com/rickmcgeer/TransCloud/issues/16#issuecomment-13569722.

Chris Matthews http://www.christophermatthews.ca

rickmcgeer commented 11 years ago

Actually, if we keep the list in sorted order in some persistent structure, all we will have to do is update it; optimization that can't matter. Sorting is cheap compared to download and file read, and, for that matter, the directory stat to get the modified times.

On the subject of other optimizations, we can actually predict with 100% accuracy which files we'll need by looking at the message queue; we issue the work orders long before they'll get done. This means we could hit theoretical optimum, not just LRU. Again, let's put this in our back pocket.

I'll take a stab at some of the code on the flight. Hampered by no Internet access (aka, no manuals, Google, w3schools, OpenLayers tutorial...very 20th century).

-- Rick

On Thu, Feb 14, 2013 at 10:16 AM, Chris Matthews notifications@github.comwrote:

That looks good to me. Im sure we wont be having more than a few hundred files, probably more like 10s on the smaller nodes, so sorting won't be too bad.

On Thu, Feb 14, 2013 at 10:12 AM, rickmcgeer notifications@github.comwrote:

I was thinking about this, but missed the corruption part. Seems to me we check that on the download. The pseudo-code looks something like this

if file in cache: touch file # update the time for LRU return else: ok = False while not ok: download file ok = checksum_ok(file) used = du -s cache directory if (used == max_size): sort files in cache by modified time (or just pick the oldest -- O(n) vs O(n log n) toss out the oldest put the new file in the cache

On Wed, Feb 13, 2013 at 11:31 PM, Chris Matthews notifications@github.comwrote:

Make a local file cache so that files are not downloaded more than they need to be. Traits of the file cache are that it should not fill the disk totally, and it should be able to check the integrity of the files since sometimes downloads are corrupted and we dont want those in the cache, and have some smart eviction policy (LRU?). It also needs to be able to be accessed by more than one process so that two workers can run on the same machine at the same time.

One thing we might want to look at is having md5 sums or something similar for each file stored in the database, so we can verify the downloaded files.

This should be easy to build and test separately from the rest of the system!

— Reply to this email directly or view it on GitHub< https://github.com/rickmcgeer/TransCloud/issues/16>.

— Reply to this email directly or view it on GitHub< https://github.com/rickmcgeer/TransCloud/issues/16#issuecomment-13569722>.

Chris Matthews http://www.christophermatthews.ca

— Reply to this email directly or view it on GitHubhttps://github.com/rickmcgeer/TransCloud/issues/16#issuecomment-13569921.

cmatthew commented 11 years ago

Actually, I was thinking that we should sort the jobs in the message queues so that they match up to WSR2 zones, that way back to back jobs would almost always use the same image. I will make a new ticket for this.

On Thu, Feb 14, 2013 at 10:24 AM, rickmcgeer notifications@github.comwrote:

Actually, if we keep the list in sorted order in some persistent structure, all we will have to do is update it; optimization that can't matter. Sorting is cheap compared to download and file read, and, for that matter, the directory stat to get the modified times.

On the subject of other optimizations, we can actually predict with 100% accuracy which files we'll need by looking at the message queue; we issue the work orders long before they'll get done. This means we could hit theoretical optimum, not just LRU. Again, let's put this in our back pocket.

I'll take a stab at some of the code on the flight. Hampered by no Internet access (aka, no manuals, Google, w3schools, OpenLayers tutorial...very 20th century).

-- Rick

On Thu, Feb 14, 2013 at 10:16 AM, Chris Matthews notifications@github.comwrote:

That looks good to me. Im sure we wont be having more than a few hundred files, probably more like 10s on the smaller nodes, so sorting won't be too bad.

On Thu, Feb 14, 2013 at 10:12 AM, rickmcgeer notifications@github.comwrote:

I was thinking about this, but missed the corruption part. Seems to me we check that on the download. The pseudo-code looks something like this

if file in cache: touch file # update the time for LRU return else: ok = False while not ok: download file ok = checksum_ok(file) used = du -s cache directory if (used == max_size): sort files in cache by modified time (or just pick the oldest -- O(n) vs O(n log n) toss out the oldest put the new file in the cache

On Wed, Feb 13, 2013 at 11:31 PM, Chris Matthews notifications@github.comwrote:

Make a local file cache so that files are not downloaded more than they need to be. Traits of the file cache are that it should not fill the disk totally, and it should be able to check the integrity of the files since sometimes downloads are corrupted and we dont want those in the cache, and have some smart eviction policy (LRU?). It also needs to be able to be accessed by more than one process so that two workers can run on the same machine at the same time.

One thing we might want to look at is having md5 sums or something similar for each file stored in the database, so we can verify the downloaded files.

This should be easy to build and test separately from the rest of the system!

— Reply to this email directly or view it on GitHub< https://github.com/rickmcgeer/TransCloud/issues/16>.

— Reply to this email directly or view it on GitHub< https://github.com/rickmcgeer/TransCloud/issues/16#issuecomment-13569722>.

Chris Matthews http://www.christophermatthews.ca

— Reply to this email directly or view it on GitHub< https://github.com/rickmcgeer/TransCloud/issues/16#issuecomment-13569921>.

— Reply to this email directly or view it on GitHubhttps://github.com/rickmcgeer/TransCloud/issues/16#issuecomment-13570332.

Chris Matthews http://www.christophermatthews.ca

rickmcgeer commented 11 years ago

now in gcswift.py, class FileCache and FileManager. Integrated and tested

rickmcgeer / TransCloud

Build a file cache. #16