Closed cmatthew closed 11 years ago
This is probably also a good task to think about minimum VM size, since it will directly impact this,
I was thinking about this, but missed the corruption part. Seems to me we check that on the download. The pseudo-code looks something like this
if file in cache: touch file # update the time for LRU return else: ok = False while not ok: download file ok = checksum_ok(file) used = du -s cache directory if (used == max_size): sort files in cache by modified time (or just pick the oldest -- O(n) vs O(n log n) toss out the oldest put the new file in the cache
On Wed, Feb 13, 2013 at 11:31 PM, Chris Matthews notifications@github.comwrote:
Make a local file cache so that files are not downloaded more than they need to be. Traits of the file cache are that it should not fill the disk totally, and it should be able to check the integrity of the files since sometimes downloads are corrupted and we dont want those in the cache, and have some smart eviction policy (LRU?). It also needs to be able to be accessed by more than one process so that two workers can run on the same machine at the same time.
One thing we might want to look at is having md5 sums or something similar for each file stored in the database, so we can verify the downloaded files.
This should be easy to build and test separately from the rest of the system!
— Reply to this email directly or view it on GitHubhttps://github.com/rickmcgeer/TransCloud/issues/16.
That looks good to me. Im sure we wont be having more than a few hundred files, probably more like 10s on the smaller nodes, so sorting won't be too bad.
On Thu, Feb 14, 2013 at 10:12 AM, rickmcgeer notifications@github.comwrote:
I was thinking about this, but missed the corruption part. Seems to me we check that on the download. The pseudo-code looks something like this
if file in cache: touch file # update the time for LRU return else: ok = False while not ok: download file ok = checksum_ok(file) used = du -s cache directory if (used == max_size): sort files in cache by modified time (or just pick the oldest -- O(n) vs O(n log n) toss out the oldest put the new file in the cache
On Wed, Feb 13, 2013 at 11:31 PM, Chris Matthews notifications@github.comwrote:
Make a local file cache so that files are not downloaded more than they need to be. Traits of the file cache are that it should not fill the disk totally, and it should be able to check the integrity of the files since sometimes downloads are corrupted and we dont want those in the cache, and have some smart eviction policy (LRU?). It also needs to be able to be accessed by more than one process so that two workers can run on the same machine at the same time.
One thing we might want to look at is having md5 sums or something similar for each file stored in the database, so we can verify the downloaded files.
This should be easy to build and test separately from the rest of the system!
— Reply to this email directly or view it on GitHub< https://github.com/rickmcgeer/TransCloud/issues/16>.
— Reply to this email directly or view it on GitHubhttps://github.com/rickmcgeer/TransCloud/issues/16#issuecomment-13569722.
Chris Matthews http://www.christophermatthews.ca
Actually, if we keep the list in sorted order in some persistent structure, all we will have to do is update it; optimization that can't matter. Sorting is cheap compared to download and file read, and, for that matter, the directory stat to get the modified times.
On the subject of other optimizations, we can actually predict with 100% accuracy which files we'll need by looking at the message queue; we issue the work orders long before they'll get done. This means we could hit theoretical optimum, not just LRU. Again, let's put this in our back pocket.
I'll take a stab at some of the code on the flight. Hampered by no Internet access (aka, no manuals, Google, w3schools, OpenLayers tutorial...very 20th century).
-- Rick
On Thu, Feb 14, 2013 at 10:16 AM, Chris Matthews notifications@github.comwrote:
That looks good to me. Im sure we wont be having more than a few hundred files, probably more like 10s on the smaller nodes, so sorting won't be too bad.
On Thu, Feb 14, 2013 at 10:12 AM, rickmcgeer notifications@github.comwrote:
I was thinking about this, but missed the corruption part. Seems to me we check that on the download. The pseudo-code looks something like this
if file in cache: touch file # update the time for LRU return else: ok = False while not ok: download file ok = checksum_ok(file) used = du -s cache directory if (used == max_size): sort files in cache by modified time (or just pick the oldest -- O(n) vs O(n log n) toss out the oldest put the new file in the cache
On Wed, Feb 13, 2013 at 11:31 PM, Chris Matthews notifications@github.comwrote:
Make a local file cache so that files are not downloaded more than they need to be. Traits of the file cache are that it should not fill the disk totally, and it should be able to check the integrity of the files since sometimes downloads are corrupted and we dont want those in the cache, and have some smart eviction policy (LRU?). It also needs to be able to be accessed by more than one process so that two workers can run on the same machine at the same time.
One thing we might want to look at is having md5 sums or something similar for each file stored in the database, so we can verify the downloaded files.
This should be easy to build and test separately from the rest of the system!
— Reply to this email directly or view it on GitHub< https://github.com/rickmcgeer/TransCloud/issues/16>.
— Reply to this email directly or view it on GitHub< https://github.com/rickmcgeer/TransCloud/issues/16#issuecomment-13569722>.
Chris Matthews http://www.christophermatthews.ca
— Reply to this email directly or view it on GitHubhttps://github.com/rickmcgeer/TransCloud/issues/16#issuecomment-13569921.
Actually, I was thinking that we should sort the jobs in the message queues so that they match up to WSR2 zones, that way back to back jobs would almost always use the same image. I will make a new ticket for this.
On Thu, Feb 14, 2013 at 10:24 AM, rickmcgeer notifications@github.comwrote:
Actually, if we keep the list in sorted order in some persistent structure, all we will have to do is update it; optimization that can't matter. Sorting is cheap compared to download and file read, and, for that matter, the directory stat to get the modified times.
On the subject of other optimizations, we can actually predict with 100% accuracy which files we'll need by looking at the message queue; we issue the work orders long before they'll get done. This means we could hit theoretical optimum, not just LRU. Again, let's put this in our back pocket.
I'll take a stab at some of the code on the flight. Hampered by no Internet access (aka, no manuals, Google, w3schools, OpenLayers tutorial...very 20th century).
-- Rick
On Thu, Feb 14, 2013 at 10:16 AM, Chris Matthews notifications@github.comwrote:
That looks good to me. Im sure we wont be having more than a few hundred files, probably more like 10s on the smaller nodes, so sorting won't be too bad.
On Thu, Feb 14, 2013 at 10:12 AM, rickmcgeer notifications@github.comwrote:
I was thinking about this, but missed the corruption part. Seems to me we check that on the download. The pseudo-code looks something like this
if file in cache: touch file # update the time for LRU return else: ok = False while not ok: download file ok = checksum_ok(file) used = du -s cache directory if (used == max_size): sort files in cache by modified time (or just pick the oldest -- O(n) vs O(n log n) toss out the oldest put the new file in the cache
On Wed, Feb 13, 2013 at 11:31 PM, Chris Matthews notifications@github.comwrote:
Make a local file cache so that files are not downloaded more than they need to be. Traits of the file cache are that it should not fill the disk totally, and it should be able to check the integrity of the files since sometimes downloads are corrupted and we dont want those in the cache, and have some smart eviction policy (LRU?). It also needs to be able to be accessed by more than one process so that two workers can run on the same machine at the same time.
One thing we might want to look at is having md5 sums or something similar for each file stored in the database, so we can verify the downloaded files.
This should be easy to build and test separately from the rest of the system!
— Reply to this email directly or view it on GitHub< https://github.com/rickmcgeer/TransCloud/issues/16>.
— Reply to this email directly or view it on GitHub< https://github.com/rickmcgeer/TransCloud/issues/16#issuecomment-13569722>.
Chris Matthews http://www.christophermatthews.ca
— Reply to this email directly or view it on GitHub< https://github.com/rickmcgeer/TransCloud/issues/16#issuecomment-13569921>.
— Reply to this email directly or view it on GitHubhttps://github.com/rickmcgeer/TransCloud/issues/16#issuecomment-13570332.
Chris Matthews http://www.christophermatthews.ca
now in gcswift.py, class FileCache and FileManager. Integrated and tested
Make a local file cache so that files are not downloaded more than they need to be. Traits of the file cache are that it should not fill the disk totally, and it should be able to check the integrity of the files since sometimes downloads are corrupted and we dont want those in the cache, and have some smart eviction policy (LRU?). It also needs to be able to be accessed by more than one process so that two workers can run on the same machine at the same time.
One thing we might want to look at is having md5 sums or something similar for each file stored in the database, so we can verify the downloaded files.
This should be easy to build and test separately from the rest of the system!