Add count to chunk_header

winksaville commented 6 years ago

Chunk header is now: "^<seqnum>/<count><space> " and paperrestore.sh validates that seqnum is monotonically increasing and end with count as its final value. If any error is found the missing seqnum's are reported.

intra2net commented 6 years ago

Thanks for sending the pull request.

Did you experience any problems during restore or was the motivation behind writing the patch to just improve user experience without having suffered from finding a missing chunk yourself?

There is a comment mentioning ^,, but it actually is ^/ in your code. Also please modify the Readme.md where the sequence is mentioned too.

I don't like the creation of the temporary files $raw_file $zero_file $sorted_file in the current directory. We don't expect huge amounts of data being restored this way, so there shouldn't be a performance problem using variables in ram for this.

One goal of the program was to make the restore script as simple as possible, so that it would be easy to adapt it to a far-future system, like a Linux from for example 25 or more years in the future. While your additions give better user feedback, they also increase the complexity considerably.

So maybe you could make the restore more simple? I think just telling the user that a chunk is missing would help. This could probably be implemented simply by extracting the total chunk count from the ^-sequence and comparing it to the output of something like tr -cd '^' | wc -c, right?

If that simple comparison finds a problem, you could jump to a function which simply does a integer increment and greps for each ^-sequence. Execution time should not matter in that case, I prefer code simplicity for that error case.

intra2net commented 6 years ago

One more thing just came to my mind: A new version of paperrestore should preserve reverse compatibility with all existing paperbackups out there.

Changing the ^-sequence style and recognizing both versions increases the complexity of the code. How about just adding an end marker with the count, like ^zzz:37 for 37 chunks?

If you find the end marker, you can check if all chunks are there, if you don't find it, it either got lost or it is an old version of paperbackup. I guess the probability that the end marker is lost and most of the other chunks are there is low enough that a more simple code makes that small loss of detailed error report worthwhile.

winksaville commented 6 years ago

I'm on vacation until May 17th, so I won't be able to make any changes till I get back. To answer your first question, yes I was unable to restore a file and had to manually search for the missing chunk, it was definitely doable but I thought it could be made easier and came up with what you see.

Backwards compatibility would be nice, but for me personally it isn't necessary and I wasn't sure you were still supporting the code as the other PR had no comments. So I did what you see as it was relatively straight forward and improved the robustness.

I'm sure my code could be made sure simpler and, of course, you're welcome to come take all, none or some of the ideas/implementation of what I've done. But, I definitely feel something should be provided to help the user years from know that specifically there was a problem restoring and where the problem was.

As I said I won't be able to do any work until after I get back from vacation, but maybe between now and then we can agree on a course of action.

--wink

intra2net commented 6 years ago

Ok, I understand. I think providing the check for missing chunks is a good idea.

I'd really like to provide backwards compatibility, because I think there are a bunch of paperbackups currently out there, at least I have a thick stack. I add a new one every few months, so there would be old and new ones mixed. Having to check out different versions of paperrestore for them would really make accessing them more inconveniant than it already is, especially for someone else who did not develop it.

About the course of action: do you agree on my idea with the end marker and having the chunk count only there?

No worries about time, I see no reason to hurry this.

Enjoy your vacation!

winksaville commented 6 years ago

No, having a single entry is vulernable to getting lost.

Basically we need additional meta datathe technique I implented here was to have the meta data in every record. This maximizing reduncy and the easiest thing to do would be to add a version number in addition to the count, but that isn't very scalable

How about this idea, we reserve sequence number zero for meta data whose content is json, yml or just name value pairs. Initially we define two fields, "ver" and "cnt".

In addition we allow multiple copies thus giving the reduncy needed, I suggest at least 3 to 5 copies of identical data. When we sort we easily remove the redundant entries.

This provides for scalability and reduncy without over duplication. I suggest not all duplicates be at the end, but one at the beginning one at the end and the others randomally distributed amongst the other data.

Your thoughts?

intra2net commented 6 years ago

No, having a single entry is vulernable to getting lost.

But you don't lose data in this case, just the conveniance of instantly getting feedback about the missing chunk numbers. If most of the cunks are missing, the point is moot and you always end up getting the data into shape manually. The new code just helps you if few chunks are missing.

I don't like the idea with the json-chunks, as this adds additional complexity. Parsing it requires additional tools like jq in paperrestore, or a complete reimplementation of paperrestore in python. Complete language stacks like python tend to have a much shorter lifespan than a Bourne-compatible shell and some standard shell utils like sed and sort. I'd really like to keep paperrestore as simple as possible.

winksaville commented 6 years ago

So we can just use line seperated name value pairs like:

ver=1 cnt=109

But do feel the reduncy is important to maximize the detection of errors, of course YMMV.

tuxlifan commented 6 years ago

How about adding an optional argument to paperrestore.sh (or a new command, see below) which allows the human to specify the total number of bar codes? In the printout above each barcode we already specify "key.asc (n/M)" so it should easily survive in an analog context.

Another possibility I see would be to introduce a third program paperfix or paperrestore-fix which could be more complex and provide good feedback and even ways to fix missing chunks (at the cost of in case of paperfix 'polluting' the $PATH a bit more). i.e. if paperrestore.sh badscan.pdf fails then paperfix --total=27 badscan.pdfcould provide additional information about the failure location and maybe even ask the user to type in specific lines or partial lines from the plaintext printout to recreate missing chunk(s). For example:

In the plaintext printout find line 8 [starting with 'BPsqEJ'] and type in everything in the left column after '706S7nK8' until 'S8igyTV1' in line 10 [starting with 'J8HHE0'], ending with an empty line:"

In general I agree with @intra2net that paperrestore.sh should stay very simple so that it is possible to easily type out all the commands by hand if need be.

To (manually) detect errors in the restored file the checksum of the original should be included in the printout (I have an example with b2sum in my reportlab branch which I just pushed, together with a script (using pdftotext) to verify that the generated PDF is good to begin with.)

Regarding redundancy: I consider the plaintext printout already sufficient for my use case since it is a full duplicate, albeit (at the moment) very user un-friendly.

intra2net commented 6 years ago

I like the idea with a separate tool which helps in case of errors. This tool can then be more complex than paperrestore and could also be written in python. Also supplying the chunk count on the commandline is easy and even without it, paperrestore-fix could detect missing chunks in between.

To (manually) detect errors in the restored file the checksum of the original should be included in the printout (I have an example with b2sum in my reportlab branch which I just pushed, together with a script (using pdftotext) to verify that the generated PDF is good to begin with.)

How does the b2sum help? Every line in the printout already has it's own md5 sum at the end, so testing if a line is valid or not should be easy. Or do I miss something here?

winksaville commented 6 years ago

I guess maintaining separate programs seems like a greater maintanance problem as changes need to be coordinated in two programs. But it would certainly work.

My other comment is that from my perspective errors aren't that rare as I could not run a successful backup and restore the first time I used it. And that was without scanning, just restoring the generated PDF failed. Of course I was probably just unluck, but it forces me to lean against two programs and drove me to create this PR.

intra2net commented 6 years ago

I guess maintaining separate programs seems like a greater maintanance problem as changes need to be coordinated in two programs.

The concept proposed by @tuxlifan doesn't need any change in the data format.

Tough we could extend it with the end marker I proposed above. paperrestore-fix would then be able to tell you e.g. "chunks 27 and above missing", without the user specifying any kind of chunk counter and the end marker missing.

The necessary change in paperrestore.sh to filter out the end marker would be minimal.

And as long as I stay picky about changes like in this thread ;) I don't see a lot of increased complexity in paperrestore.sh, resulting in minimal maintenance required.

My other comment is that from my perspective errors aren't that rare as I could not run a successful backup and restore the first time I used it. And that was without scanning, just restoring the generated PDF failed.

Are you able to reproduce this problem with some kind of special data pattern?

I always test the produced pdf with paperrestore and diff (like shown in the readme) and never had this problem. I also never heard of it before and I guess there are several other users of paperbackup out there.

winksaville commented 6 years ago

I'll see if I still have failing data when I get back. And It failed conistently using the technique described in the readme.

I'll go ahead and close this PR and if I do have the data I'll create an issue, txs.

intra2net commented 6 years ago

Maybe it is an issue with specific versions of zbar or LaTex/PyX. So you could try this data with other versions of them, e.g. by using a live cd, docker container or similar of another distro.

tuxlifan commented 6 years ago

@intra2net

To (manually) detect errors in the restored file the checksum of the original should be included in the printout.

How does the b2sum help? Every line in the printout already has it's own md5 sum at the end, so testing if a line is valid or not should be easy. Or do I miss something here?

Sorry, I think I wasn't clearly enough changing the subject from paperrestore-fix and partially restoring some junks to talking about the full output, I.e. the b2sum would provide a means for the user to verify (by running b2sum restored.asc and visually comparing the output to the printed b2sum) that the restored data created from reading the bar codes is complete; paperrestore.sh (at least the version in master with all the pipes after zbarimg) will happily output all the data it can read but would not recognize missing/mangled/etc. chunks - so only when trying to import back into gpg one would get a CRC error.

@winksaville I also experienced such failure when test restoring a generated PDF once which prompted me to include the b2sum, though strangely when generating from the same file again it came out perfect and unfortunately I didn't think soon enough to save the bad example.

PS: I'll see if I can put in a different PR tomorrow with the b2sum stuff to play with.

intra2net / paperbackup

Add count to chunk_header #3