net-lisias-ksp / KSP-Forum-Preservation-Project

Efforts for preserving https://forum.kerbalspaceprogram.com/ for the posteriority if the worst happens. We are hoping for the best, but expecting the worst.
https://forum.kerbalspaceprogram.com/topic/225368-ksp-forums-archival-options/
16 stars 0 forks source link

A lot of pages have some links "screwed". We need a filter to hot-fix these somehow. #15

Open Lisias opened 1 day ago

Lisias commented 1 day ago

I found these two URLS on my "ALL" report this month (not meaning they weren't there before, I just noticed them today):

https://forum.kerbalspaceprogram.com/%7B___base_url___%7D/index.php?/profile/128696-killashley/
https://forum.kerbalspaceprogram.com/%7B___base_url___%7D/index.php?/profile/42312-alexsheff/

Note the %7B___base_url___%7D substring, that unencoded gives us {___base_url___}. Almost surely is a missing $ after the opening curly braces.

Curious about the issue, and knowing that this kind of issue reproduce like rabbits :P I coded a quick report for all the occurrences on the current (and WIP) WARCs, and boy, I found a lot (note: file in CSV format, ignore anything starting with #): [Uploading url_weirdities.csv…]()

The earliest thread with the problem is 278, and the biggest id is 209425.

'cat url_weirdities.csv | grep -Eo 'https://forum.kerbalspaceprogram.com/index\.php\?/topic/([0-9]+)-' | sed -E 's/^https:\/\/forum.kerbalspaceprogram.com\/index.php\?\/topic\/(.+?)-$/\1/g' | sort -n | uniq`

Fixing the problem in the WARC file is out of the question (the thing need to be exactly as I fetched them), so we need to find a way to work around these problems.

A filter on the playback machine to detect and fix these will do but, so, we will need a cache to keep the thing responsible - python is not exactly the fastest cookie in the jar.