Incremental updates using existing md5 string?

silverbucket commented 9 years ago

I see examples for updating an md5 incrementally based on incoming chunks of the file, with append. I was wondering if there was a way to give SparkMD5 an existing md5 string of the file up to the point that you will provide the next chunk of data.

For example, if you are resuming a previously aborted operation, and you know you have computed the first 10 of 15 chunks. If you have the md5 of the first 10 chunks stored in the browser cache, you should be able to give SparkMD5 that md5 string, then send it the 11th chunk, and so on.

Is that possible?

silverbucket commented 9 years ago

ping @satazor ... is this possible to do with SparkMD5?

satazor commented 9 years ago

Yes you can use buffers or strings, check this spec for an example with a string: https://github.com/satazor/SparkMD5/blob/master/test/specs.js#L61

satazor commented 9 years ago

Oh sorry, didnt quite understand your question at first. I am afraid its not possible at the moment, but definivetly possible with some changes to the code and a way to set and get internal state.

silverbucket commented 9 years ago

@satazor any suggestions on how I could achieve this? I'm willing to give it a shot and submit a patch request if you aren't able to at the moment, but some advice from someone familiar with the code would help get me going in the right direction.

satazor commented 9 years ago

Hey @silverbucket

I've made it possible in the state branch.

There's two new functions, getState() => state and setState(state). Please test it and see how it works. At the moment I'm short on free time, so I would appreciate if you could add tests and documentation to the README on that feature branch.

Will wait for your feedback.

silverbucket commented 9 years ago

@satazor awesome! I will have a look at this today and let you know how it goes. I'm happy to add tests and docs assuming everything goes well.

silverbucket commented 9 years ago

Hi @satazor - I had a look at this, and perhaps there's still a misunderstand of what I'm looking for. The functionality you added requires a Binary Blob as one of it's parameters, which means that there must be resident in memory the entire file.

What I was asking about is, providing the md5 string of the latest bit of data, and computing the next bit without needed the entire file.

Let's say you have 1 file split into 5 chunks, and you want to continue to generate your md5 sum as the incoming chunks arrive.

init SparkMD5.ArrayBuffer object
begin transfer
chunk 1 -> 38049dbc642b1f957563aabd5874450c
chunk 2 -> c04347f6eebc69c5d787c4df247b42e8
chunk 3 -> b79276e973a320c6b53cb09c69beb587

Now, let's say at this point the transfer terminates, page reloads, or something else happens that aborts the file transfer. My question is, could we continue to generate the final md5 sum when the page reloads and we resume the transfer, by providing the latest md5 sum that we have (b79276e973a320c6b53cb09c69beb587) ?

init SparkMD5.ArrayBuffer
provide existing md5 as starting point
begin resume-able transfer
chunk 4 -> 6f68b61c9bf25371bf65d4ee3cf646cc
chunk 5 -> d85bdd9bc25a76431a7a17c13bdbc9fa

So the final md5 checksum (d85bdd9bc25a76431a7a17c13bdbc9fa) would match if we were to perform an md5 checksum on the completed downloaded file as a whole.

I've read this is possible (Incremental Checksum) but it appears like currently, SparkMD5 contains the entire file within it's object. Meaning we have 2 copies of the file being download resident in memory (one for the file transfer itself, the other within SparkMD5 .append() behavior. Is this correct? It's also quite possible I misunderstand the capabilities of incremental checksumming.

silverbucket commented 9 years ago

I've heard people refer to Adler/Fletcher checksumming as what does this, mostly for error detection of corrupt packets, etc.

satazor commented 9 years ago

Hi @satazor - I had a look at this, and perhaps there's still a misunderstand of what I'm looking for. The functionality you added requires a Binary Blob as one of it's parameters, which means that there must be resident in memory the entire file.

No, the buffer is only 64 bits maximum. MD5 works by computing 64 bits each time. Every time that .append() is called, it concatenates what you are passing into a buffer. Then if the buffer has atleast 64 bits, then spark computes 64 bits for each cycle and removes that chunk from the buffer (it actually repeats until the buffer doesn't have more than 64 bits to consume). This means that the buffer will contain residual bits that will be used in the next append(). See: https://github.com/satazor/SparkMD5/blob/state/spark-md5.js#L344

I understood what you ask for, but spark cannot computed the md5 after each chunk, because the tail of the whole input must be computed differently (thats what end() does).

For this to work as you want, you must call .getState() and store it somewhere, perhaps in the local storage and then call .setState() to resume the previous known state. Again, there's no problem storing it in the local storage because that will contain a maximum 64 bits plus a bit more because the object also contains the length and the current hash.

satazor commented 9 years ago

If you are having problems understanding what I'm trying to say, I can make you a quick example of getState() and setState(), plus local storage.

silverbucket commented 9 years ago

aha i see, so the binary object passed in during setState() is just the latest 64kb of the file? along with the length and array values.

On Wed, Jul 1, 2015, 12:54 André Cruz notifications@github.com wrote:

Hi @satazor https://github.com/satazor - I had a look at this, and perhaps there's still a misunderstand of what I'm looking for. The functionality you added requires a Binary Blob as one of it's parameters, which means that there must be resident in memory the entire file.

No, the buffer is only 64 bits maximum. MD5 works by computing each 64 bits each time. Each time .append() is called, it concatenates what you are passing into the buffer. Then if the buffer has at list 64 bits, then spark computes a cycle and removes the computed chunk from the buffer. This means that the buffer will contain residual bits that will be used in the next append().

I understood what you ask for, but spark cannot computed the md5 after each chunk, because the tail of the whole input must be computed differently (thats what end() does).

For this to work as you want, you must call .getState() and store if somewhere, perhaps in the local storage and then call .setState() to resume the previous known state. Again, there's no problem storing it in the local storage because that will contain a maximum 64 bits plus a bit more because of the object itself.

— Reply to this email directly or view it on GitHub https://github.com/satazor/SparkMD5/issues/24#issuecomment-117603190.

satazor commented 9 years ago

Exactly, and its not 64kb its 64bits which is really small.

silverbucket commented 9 years ago

@satazor aha, ok now I understand much better, and if the chunk is larger that 64 bits, does anything break or will SparkMD5 just ignore the previous data?

satazor commented 9 years ago

If you call append with a chunk of 16000000 bits (2 megabytes) it will execute the for loop 250000 times and will leave behind a buffer of 0 bits.

If you call append with a chunk of 16000020 bits (2 megabytes plus 20 bits) it will execute the for loop 250000 times and will leave behind a buffer of 20 bits.

SparkMD5 cannot ignore data otherwise the computed hash won't be correct.

silverbucket commented 9 years ago

OK, so I'll make sure to slice it to be exactly the latest 64 bits. Thanks! I am still working on my own integration and testing, will submit a PR for tests and docs when I'm certain it's working.

satazor commented 9 years ago

Yes that will leave you behind an empty buffer always. Can I ask why is it an issue to store the buffer?

satazor commented 9 years ago

You should be able to just do a JSON.stringify(spark.getState()) and store it in the local storage and then to resume: spark.setState(JSON.parse(storedvalue));

silverbucket commented 9 years ago

That's true, if I've already got to store the other two properties I may as well store the buffer as well. It's only replicating the latest 64bits. The reason I didn't previously is because I incorrectly assumed the buffer needed to be all the existing file data.

So, I will need to store the getState() output, along with the chunk number attributed to the latest state. That way, during page load, if there are more chunks stored than I have indicated I've performed the md5 checksum on, I can get those missing chunks out of IndexedDB and catch up.

satazor commented 9 years ago

Yep. Let me know how it goes.

silverbucket commented 9 years ago

OK, so I gave this a shot and when I load in previous state data via setState I get the error Uncaught RangeError: Source is too large. It points to https://github.com/satazor/SparkMD5/blob/state/spark-md5.js#L594

{
  buff: {
    0: 152,
    1: 236,
    2: 242,
    3: 228,
    4: 227,
    5: 122,
    6: 9,
    7: 54,
    8: 225,
    9: 69,
    10: 88,
    11: 21,
    12: 19,
    13: 78,
    14: 133,
    15: 255,
    16: 0,
    17: 7,
    18: 255,
    19: 217
  },
  hash: [
    1518490340,
    1206410578,
    -1075573721,
    1028990893
  ],
  length: 22356
}

satazor commented 9 years ago

Hmm, can you provide a working example where I can reproduce and fix it?

satazor commented 9 years ago

@silverbucket I've updated the branch, can you try it? It should be ok now, I've added tests for getState and setState and they are all passing.

Btw I made a mistake in the explanation above, its not 64 bits its 64 bytes (512 bits)

The README is also updated. I will wait for your feedback before merging.

silverbucket commented 9 years ago

That fixed it! thanks. I'd still getting mismatched md5 checksums upon reload, but this could be an issue with my code. I'll keep working through it and if I still have issues I'll try to create a simplified example. If you've already made passing tests then I assume it's an issue on my end. Thanks for you help! Since you've already done the tests and updated the README, let me know if there's anything I can do to help.

satazor commented 9 years ago

Ok, I will be waiting. There's nothing important to be done, thanks!

satazor commented 9 years ago

Bump.

silverbucket commented 9 years ago

Literally just finished confirming everything is working now (I had a long weekend vacation). The issue I was having seems to be related to cases where the md5state wasn't being reset correctly (error in my code), so subsequent downloads of the same file started off with existing data. I fixed all that and am able to confirm that the state additions you made are working perfectly! So I think it's good to merge. Thanks again for your work!

satazor commented 9 years ago

Great!

satazor / js-spark-md5

Incremental updates using existing md5 string? #24