robx / puzzledb

Puzzle archive / search engine
1 stars 0 forks source link

Web crawler should strip invalid characters from the url #23

Open edderiofer opened 4 years ago

edderiofer commented 4 years ago

As per the title. These three Geradeweg puzzles aren't being marked in the DB as being solved, even after solving them. They currently apparently have zero solves on the DB, despite plenty of people having solved them on PuzSQ via the puzz.link interface.

http://puzsq.sakura.ne.jp/main/puzzle_play.php?pid=3527 http://puzsq.sakura.ne.jp/main/puzzle_play.php?pid=3528 http://puzsq.sakura.ne.jp/main/puzzle_play.php?pid=3312

edderiofer commented 4 years ago

Same is true of this Nagare puzzle:

http://puzzleblog542.blog.fc2.com/blog-date-20160214.html

robx commented 4 years ago

The nagare has broken html:

<a href=http://pzv.jp/p.html?nagare/10/10/a4j7b81a7b7b8b5n7b82a7b7b8b8n9b7b7b5a16b6l"

but I'm not sure how that would cause this issue.

robx commented 4 years ago

The geradewegs have a trailing space:

<a href="https://puzz.link/p?geradeweg/10/10/h1g1g1g1q1k2h1g2g11h2j2g1j11i1m2h1g2g1g2n1i1j1g2j1g1g "
robx commented 4 years ago

I've fixed these (and a couple more instances) in the db now, but the crawler still needs to be fixed to deal with this.

robx commented 4 years ago

For future reference, that's trailing double quotes and spaces in these cases.

edderiofer commented 4 years ago

Here's another case: fullwidth forms in the URL: https://puzsq.jp/main/puzzle_play.php?pid=2833

edderiofer commented 3 years ago

This one also isn’t working for some reason (hyphen in URL?): http://kurotento.blog.fc2.com/blog-entry-156.html

https://puzz.link/p?norinori/10/10/g1i89g5f5u3sn0t4g07svvu8bt1o1ubs7ufv

Wait, no, this one seems to have fixed itself now. Odd.

edderiofer commented 1 year ago

Another puzzle with 0 solves that seems to be anomalous: http://blog.livedoor.jp/bachelor_seal-puzzle/archives/86923975.html

robx commented 1 year ago

Another puzzle with 0 solves that seems to be anomalous: http://blog.livedoor.jp/bachelor_seal-puzzle/archives/86923975.html

Seems fine?

(Also this would be a separate issue, since new bachelor seal puzzles come in through the blog feed, not the inactive web crawler.)

edderiofer commented 1 year ago

Yeah, I guess I was probably just too hasty and caught the puzzle when it was at 0 solves.