tallforasmurf / CoBro

A minimal web browser for reading web comics
2 stars 1 forks source link

Too many comics fail hash test #13

Closed tallforasmurf closed 11 years ago

tallforasmurf commented 11 years ago

Many comics fail the hash test even when the image itself is not new.

Test that the hash is actually implemented right.

Gregor I know implements its "Random" link by actually providing a different target to the link each time the page is loaded -- instead of having an unchanging javascript call e.g.

Others are probably varying the ads? Investigate. Is there a pattern? Could part of a comic be skipped for the hash test by some type of RE?

tallforasmurf commented 11 years ago

By way of study write a test function to read the same url twice and hash each - do they differ even when read quickly?

If so, consider getting a bit mask of the lines that don't vary between two quick reads. Save that and use it to filter the lines used in the hash?

tallforasmurf commented 11 years ago

Good news and bad news. For A Multiverse, this is the consistent comparison from successive reads:

Nonmatching Lines (File amv2.txt; Line 13:14; File amv1.txt; Line 13:14) Nonmatching Lines (File amv2.txt; Line 228; File amv1.txt; Line 228) Nonmatching Lines (File amv2.txt; Line 276; File amv1.txt; Line 276) Nonmatching Lines (File amv2.txt; Line 282:283; File amv1.txt; Line 282:283)

So bad: it's not just at the top of the file, so forget doing head truncation. Good news: for this comic at least it's always the same lines for tiny differences in javascript code, crap like Dynamic page generated in 0.524 seconds.

tallforasmurf commented 11 years ago

Gregor: 1 line, implements random comic Happle tea: one stupid line ` Megacynics: two stupid lines like that Toothpaste for dinner: one line, not clear what it is doing Hark a vagrant: two lines, each an instance of the random comic button Robbie & Bobbie: 1 line, some kind of email protection hash code SMBC one time came up with diffs and then didn't Questionable Content: 2 lines, random comic Dinosaur comics: 2 lines, related to random things on the page Candorville : 1 stupid line like Happle

All others came back identical on two reads

tallforasmurf commented 11 years ago

So: read a url twice into separate strings then open the strings as files and use readln to inspect by lines. (ooh! open two requests and read by lines??)

Where lines are identical, shove one into the hash machine. at end, take that signature.

tallforasmurf commented 11 years ago

Moved strip of line to after test for end of file, duh.