Add hashing for verifying correct input of code

Chillee commented 5 years ago

See https://github.com/kth-competitive-programming/kactl/issues/63#issuecomment-485977991

I don't think that hashing sections is worth it. MIT does hashing in 8 snippets: LCT, LinearRecurrence,Simplex.h, Polynomial,CycleCounting,GraphDominator, and both suffix arrays

I would split that into "Should be split into different sections": Polynomial, CycleCounting "trying to avoid hashing the typedef": LinearRecurrence "Has parts that you don't always want": Both suffix arrays (ie: don't always need LCP) "Not sure": LCT, Simplex, and GraphDominator (I don't know enough about the algorithms to understand whether you pretty much always want all functions)

That's a maximum of 4 snippets where it might be advantageous to have section-wise hashing.

The other argument for hashing sections is that if the hash fails, then you need to look at less of your code. I haven't done many offline contests with a TCR, but from my experience, knowing that you have a mistype in 50 lines of code is only marginally better than knowing you have a mistype in 100 lines of code. Both of these are massively better than not knowing whether you have a mistype or a logic error.

If we were to hash by section, I would propose having some kind of lightweight syntax (like a //<-- ) to demarcate sections, and then putting the hashes (truncated to 5 characters) in the header.

Like so:

Another question with hashing is how we deal with things like typedefs, especially if they're typedefs that are likely to be typed multiple times (for example, typedef vector<ll> Poly). I think it's not too big of a deal, I would suggest to just get used to typing them in for the purpose of hashing.

My biggest problem with avoiding them automatically is ambiguity with what hashes represent. "We hash everything that's printed" is obvious. "We hash everything after the typedefs" is less obvious.

simonlindholm commented 5 years ago

There are actually a fair number of cases where you might/will type in only parts of the code: Treap, FastSubsetTransform (that one's weird), euclid, chinese, 2sat, TreePower, HLD (on the chopping block), sideOf, Angle, KMP, SuffixTree, Hashing, AhoCorasick, IntervalContainer. And in several more I can imagine that the 100->50 line reduction is handy. So if we could come up with some slick UI for indicating sections I'd be all for it. I agree with your comment about ambiguity, though, and I think we can start simple.

ecnerwala commented 5 years ago

Just a note: I updated the hash script in our book to include the -dD flag, which preserves macro definitions. It's now cpp -dD -P -fpreprocessed | tr -d '[:space:]' | md5sum -

ecnerwala commented 5 years ago

The other argument for hashing sections is that if the hash fails, then you need to look at less of your code. I haven't done many offline contests with a TCR, but from my experience, knowing that you have a mistype in 50 lines of code is only marginally better than knowing you have a mistype in 100 lines of code. Both of these are massively better than not knowing whether you have a mistype or a logic error.

I think knowing you have a mistype in 50 vs 100 lines of code is actually linearly (~2x) better for finding the bug, which amounts to maybe 5 minutes of time (and feeling a lot happier).

ecnerwala commented 5 years ago

Also, I'll note that we would've hashed sections in more files if we used them more/weren't too lazy to add the annotations. Honestly, we mostly used kactl for the stuff we added (which we broke into sections) and the geometry (which is short to begin with).

simonlindholm commented 5 years ago

Thanks for the note, I've made that change: https://github.com/kth-competitive-programming/kactl/commit/dcdc34aeb59dc8e52eafdf20f4b6f6926078d378 (note also the golfed vimrc: ca Hash w !cpp -dD -P -fpreprocessed \| tr -d '[:space:]' \| md5sum \| cut -c-6)

lrvideckis commented 5 months ago

Hi, I want to propose an idea for "partial hashes", idea communicated to me by https://codeforces.com/profile/camc

let's say you want a struct:

struct LCA {
...
    LCA(vector<vi>& C) : time(sz(C)), rmq((dfs(C,0,-1), ret)) {}
    void dfs(vector<vi>& C, int v, int par) {
...
    }

    int lca(int a, int b) {
        if (a == b) return a;
        tie(a, b) = minmax(time[a], time[b]);
        return path[rmq.query(a, b)];
    }
    int dist(a,b) {return depth[a] + depth[b] - 2*depth[lca(a,b)];}
        int inSubtree(a,b) {return time[a] <= time[b] && time[b] < timeOut[a];}
        int nodeOnPath(u,v,w) {...}
...
};

you can split it up like: LCA.h:

struct LCA {
...
    LCA(vector<vi>& C) : time(sz(C)), rmq((dfs(C,0,-1), ret)) {}
    void dfs(vector<vi>& C, int v, int par) {
...
    }
#include "lcaFunc.h"
#include "dist.h"
#include "inSubtree.h"
#include "nodeOnPath.h"
};

lcaFunc.h:

#pragma once
    int lca(int a, int b) {
        if (a == b) return a;
        tie(a, b) = minmax(time[a], time[b]);
        return path[rmq.query(a, b)];
    }

dist.h:

#pragma once
    int dist(a,b) {return depth[a] + depth[b] - 2*depth[lca(a,b)];}

inSubtree.h:

#pragma once
        int inSubtree(a,b) {return time[a] <= time[b] && time[b] < timeOut[a];}

... etc

Now each member function is in it's own file, thus has it's own hash. Furthermore, you type exactly what you need: if you only need lca function, you only type it;verify hash, then copy into struct.

If you need lca,dist, inSubtree, you type all three, verify all their hashes, then copy them into the struct

Furthermore, the include statements tell you exactly where to put the member functions

lrvideckis commented 5 months ago

Now you don't want to force the user to type those include statements, so for me, when I generate the .pdf, I have this in a script:

contest/hash.sh:

tr -d '[:space:]' | md5sum | cut -c-6

generate_pdf.sh:

shopt -s globstar
for header in ../content/**/*.h; do
    hash=$(sed '/^#include/d' "$header" | cpp -dD -P -fpreprocessed | ./../contest/hash.sh)
    sed --in-place "1i //hash: $hash" "$header"
done

lrvideckis commented 5 months ago

furthermore, if you use something like the expander script for codeforces rounds where you can copy-paste; this method should still work

lrvideckis commented 5 months ago

for example for you can split apart fenwick tree lower bound https://github.com/kth-competitive-programming/kactl/blob/main/content/data-structures/FenwickTree.h#L24 as you rarely need that function

for example for this

https://github.com/kth-competitive-programming/kactl/blob/main/content/graph/CompressTree.h#L18

where you pass in LCA& lca as a parameter, Instead, you could add compressTree as a member function of LCA; splitting up files using this trick; now no need to pass in lca as a param; also instead of lca.lca(a, b) syntax, it's now lca(a, b) syntax

kth-competitive-programming / kactl

Add hashing for verifying correct input of code #72