Script for book overlap analysis and combination through various heuristics.

Sopel97 commented 2 years ago

First, a rough stream of my investigations and thoughts

I'm not sure how much of the internet was scouted, but I went through most of https://torg.pl/tibia/180777-winged-helmet-quest-niemal-kompletny-spoiler.html?highlight=Winged+helmet and https://torg.pl/tibia/265100-hellgates-library-469-spoilery.html (sources in polish), and while there's a lot of bait, numerology, and other unintresting stuff, it's a good place to spark some thoughts if one has time to read through the garbage.

I strongly believe that this is not a cypher (or at least not anything even slightly sophisticated, definitely not RSA) and instead is just a transcription of the "spoken" 469 in which decimal numbers (of varying length) correspond to words, syllables, and individual letters. Distinct from english. These are the reasons why I think so:

The distribution of digits in the hellgate library books is very far from uniform. A uniform distribution of digits would be characteristic of strong encryption. It might be a weak cypher.
Known sequences repeat quite obviously, which suggests that there is neither base conversion happening nor position dependent encryption (so leaves out pretty much only digit by digit encryption, i.e. simple substitution?).
Nothing changed when the english name for beholders/bonelords changed.

The two hard parts are finding out how to delimit codewords (numbers from digits), and what these codewords mean. This may be where the math bonelords talk about gets involved...

Some other points I want to bring up while I'm here

Assuming it's a language words may have many meanings and translations. For example we know 1 means "tibia", but it also may mean anything that's synonymous with "creation", "everything", "good", "gods", "world", "universe", etc. 1.1. Bonelords don't like 0, which could mean the opposite of 1 - "evil", "devil", "nothingness" etc. This doesn't have to be a word that doesn't get used at all, and it actually is used in some places.
Why sometimes with text is delimited by spaces and sometimes not? Could it be just stylistical? Some (historical) languages for example omit vowels in written form.
Can we assume that all 469 in tibia is valid? People IRL brag about knowing it while they more than likely don't. Same might be true in game. For example Avar Tar.
What about that single book that's copied in IoK? Could it have been translated? There's this book https://tibia.fandom.com/wiki/Selfmade_Skeletons_(Book), also in IoK library (that's the best candidate there from what I found), and it is undeniably about bonelords. Though it may be considered empty... why is it empty?. "By BH" probably means "By BeHolder"; not unthinkable that it was missed during translation to "Bonelord". Perhaps there's a better match somewhere else?
Since some books in hellgate library are partially redundant, could it have been that they were at some point (in the lore) fully redundant (say, 2x in total), and with time books disappeared, reducing the redudancy, and potentially even corrupting the full message making it even harder to reconstruct?

Now, about this script and the outputs.

This is some continuation around the idea that there is less logical books than physical. There are some attempts here to combine the books, and manual work will probably be required to get this right if that's the correct direction, but some brute forcing never hurts.

The script try_combine_books.py first removes duplicate books and books fully contained by others (as a subsequence) (There's 50 left after this. It could be simulated if this a likely outcome of choosing random ranges of characters from a longer sequence (or multiple thereof).). Next it forms a graph where a book is a node, and an edge a->b means that the suffix of book a overlaps with the prefix of book b, and the weight of that edge is the overlap length (these appear quite chaotic). These graphs are rendered and can be found in the out/graphs directory. Additionally the script then takes such a graph and decomposes it into paths that can be made into one book each (here I describe results from all edges, it's possible to limit edges to ones with at least N-wide overlap and it might be a reasonable thing to do). This is done by a greedy algorithm. The order of edges will change the outcome. Two (three) ways of ordering edges are considered right now:

By overlap length, descending. This results in possibly the maximum number of digits cut out. The result is 9 books, 4 of which are unchanged.
Shuffled with a weight being a function of overlap length. This allows searching for outputs where there's much fewer books, but also shows that there's a lot of ways to do so, so how do we actually get the supposed one right configuration?
Same as 2. but don't allow edges corresponding to overlap =N.

Ad 2,3. The best we can do while removing edges with overlap <3 is around 15 resulting books, around 5 of which are unchanged. The best we can do overall is just 2 books, and one of them is unchanged. That means we can combine 49 books with some overlaps by concatenation into one. All of these outputs are available as .svg renders and .txt libraries (one book per line) in the out directory. It's trivial to generate more. There are books that don't have overlaps larger than 1 with anything else (after unification).

Open question. Is there a hamiltonian path in this graph? Perhaps even a hamiltonian cycle? Would it matter?

Ultimately I think that if there's anything in the combining of these books it will have to be done mostly manually, or at least be supervised by a human. I hope that the script will at least be helpful for some statistical work or something.

The script requires graphviz (executable + python package), networkx.

s2ward commented 2 years ago

Hi,

I think you bring some very good points around the language, and it seems like we've come to similar conclusions!

The self-made skeletons book is very interesting as it does fit that a bonelord could've written it, except for the detail that it's in English (this could be explained by the original book being translated..). But it anyhow seems like BH = BeHolder, IIRC we used to abbreviate beholder to BH a long time ago.

I don't think Avar Tar has valid 469, but in 2020? or so, tibia.org used to display avar tar poem but with one number sequence exchanged.. So it's actually difficult to tell if all 469 is valid 469..

Combining the books seem possible, but the end is quite difficult as you kind of need to make decisions on how to unscramble some book.. and all books might not even be related to a single book - there could be two different books that everything combines into which I've seen some indication of but have not tried.

Whenever we have a scrambled book and 3 other books that are identical yet contain all sequences in the scrambled book, then you can assign a high 'weight' to the 3 identical ones and assume that you can unscramble the lone book to match the three books.. Instead of unscrambling the 3 identical books to match the lone book.. I think this is the way to unscramble and get one or two huge lines of actual 469. Difficult part is identifying the scrambled book and its place - because if you do something wrong, all work thereafter will probably be wrong.

Your attempts at bruteforcing are very very interesting and I'm confident that it would help immensely to combine into a single book if that's the right direction. It will take some time to analyze still but this is great work.

Sopel97 commented 2 years ago

Considering the output of the script I'm fairly certain that there is no way to combine these books into 1 UNLESS there are mistakes that need to be corrected (and I think I've see at least one digit being mentioned to be off in similar strings? that would be from one of the .txts in this repo, can't find it now). So one approach could be to modify books -> try brute force again -> see what's spit out. To get anything meaningful it's probably required to restrict the minimum overlap to at least 3 (or 4) (see comments in the code). This is because if we allow smaller overlaps there's a good probability that we would combine books by random chance - which could completely malform the rest of the text (depending on the formula to determine word/syllable/letter splits). Right now using a minimum of 3 results in a sizable amount of books still... But maybe not everything is actually relevant... (like, maybe there's one big book and some other minor unrelated stuff)

I might do some manual work around this when I have time again, but I first need to think about a better tool for this (and actually making it...), because a text editor is painful.

s2ward / 469