Decoding test - Githubissues

void4 commented 3 years ago

Has anyone ever tested whether the cosmicos message can be successfully decoded by a single person or group (possibly of experts, like computer scientists), with no prior knowledge of its structure or content?

paulfitz commented 3 years ago

Not as far as I know! It would be a fun test though. Other messages in the past have been tried out informally in newsgroups and the like. There was a proposal a while back to get a bit more organized about this https://arxiv.org/pdf/1101.4968.pdf (section 5). I enjoyed decoding that team's message (a bit hard to find online now? I archived it at https://github.com/paulfitz/dearet/blob/master/Test_Message_2p1.txt) though I definitely had to go revise some more physics to do it.

void4 commented 3 years ago

I have been purposely not reading much about the cosmicos message and try to find structure in the raw data sometimes, and it turns out you may have to "idiot-proof" messages sent to ~~stupid~~ less intelligent civilizations (consisting solely of the likes of me for example) :smile:

It might also be worth looking into Natural Semantic Metalanguage, which, while it may not be a true theory, might help create structures that describe and differentiate semantically more complex concepts using simpler ones.

Btw, is the base-4 message in the second tab on the site really equivalent to the one using the symbols in the first tab?

Pavgran commented 2 years ago

I'm in the process of decoding the message. Unfortunately, I do have prior knowledge as I've read a bit about the structure and contents before trying to decode.

I've started with the wrapped.txt file and first tried to analyze the frequency of individual symbols, of all pairs, all triplets and all fours. That didn't give much. After that, I've tried to spot some patterns in the message itself. The first thing I noted is a big chunk consisting almost entirely of 0s and 1s. I couldn't decipher anything from it, and I decided to take a close look at the start.

After a bit of searching I've spotted the pattern of repeating 321s with something in between. I wrote a little program to tell me what are the lengths of runs of 321. I've spotted that some of the runs are of increasing length. After splitting the message by lines with a newline after every run, I saw a clear structure of small messages about numbers. There were a few undecipherable messages but the ones about numbers were first about sequential numbers, then about squares, then about primes.

After that I've tried to find how can I tell these messages apart without the help of numbers. After a bit of shuffling prefixes and suffixes of messages around, I've spotted the 32233 pattern that looked like a separator. After splitting the whole message by that, I saw that the idea was probably right.

I then tried to align same parts of different messages to identify the structure. After a bit of fiddling, I've spotted that chunks of numbers are preceded by 21001032xxxxxx, and that all the messages about numbers start by 210111132xxxxxx. After that, I've spotted the general pattern of 2xxxxx, the x's being 0s or 1s. It looked like these 2xxxxx were kind of labels, and so I decided to try to attach some meaningful names to them.

I've quickly spotted that 210010 probably meant 'define', 2110000 probably meant 'is a number', 2110001 'is a square' and 2110010 'is a prime'.

The next parts were having different overall structure but the numbers were the same. Here I thought that probably the correct prefix for number was 2111 that meant 'number'. Through the still-cryptic structure I managed to see that the next messages were about equality and order relations so I quickly assigned the meaning to symbols 211, 2100 and 210. The next part had a clear structure of 3 messages, the first one being the repeat of one of the previous messages and the second two were substituting the relation with another one and attaching a 2101 label before that. I though that it may be a 'not'.

I kinda blazed through the part about +, - and * as there were a lot of examples that just confirmed my first thought.

After that something strange appeared. It looked like symbols were equated to numbers. I've somewhat quickly realized after reverting back to 2xxxx representation that that was actually equating numbers to their binary representation. A bit further there was even an arithmetic on binary numbers and that was confusing as some of the numbers looked like they were actual binary numbers and some of the numbers were symbols.

I've tried to decipher the parts of the structure I've skipped before but there was no clear explanations. It looked like 2 preceded symbols and 3 was some kind of separator, but there were some weird things like there being a 2 that was not followed by binary, binary 1 not preceded by 2 and other strange things.

I've decided to go on and faced a wall of some hard-to-understand arithmetic examples. They started simple with = 6 6, but quickly escalated to some cryptic structure although the overall meaning was kinda clear.

I don't know how exactly I've come up with the idea of 2 and 3 being parentheses. Maybe it was influenced a lot by the prior knowledge, maybe it wasn't, but after I replaced the 2 with '(' and 3 with ')' I immediately saw that that was the right idea. The structure looked sensible apart from out-of-place '1()' sometimes. And that's when I realized that the message separator was not 32233 but 2233 meaning '(())'.

I plan to continue to decode the message after that.

TL;DR I tried to decode with some prior knowledge. The message allowed for iterative deciphering with simultaneously advancing through new messages and fixing the holes in previous descriptions. I wouldn't say that was easy, although it was doable. But I've deciphered only the very beginning of the message so far.

Final thoughts: The more structure there is, the better. Very regular beginning with the numbers going up one by one helped a lot. Regular structure with grouping different kinds of facts about numbers helped in mentally separating different blocks. Very regular introduction of comparison operators and multiplication helped in quickly formulating the first though and confirming it was correct. Introducing 'not' with one true statement and two false statements about the same numbers was crucial to understand what's going on. I think there is not enough introduction on the parenthesis structure of the message. There are some cryptic constructions like '1()' that are very hard to understand when you've not even understood that 2 and 3 are actually parentheses. Probably it's a good idea to hint about parentheses with, for example, the first few Catalan numbers and/or more examples (and more structure in examples) in the 'alternative groupings' section. I don't know if the next sections are still easy to comprehend without the understanding of the parenthesis structure of the message.

paulfitz commented 2 years ago

Yay @Pavgran, thanks for trying, and the detailed post! Good point about parentheses. There's some needlessly cryptic constructions, I've been meaning to remap the message into a larger "symbol space" so component symbols are more distinct.

joha2 commented 2 years ago

@Pavgran nice comment on decoding cosmicos! @paulfitz sorry for chiming in: the real problem seems not to be the small alphabet (two delimiter symbols + two data symbols), but the strange syntax sometimes. I.e. the less syntax elements, the better.

At the end the main problem is, that you only have a linear message string and want to transport multi dimensional data (lists of lists of lists of ...) in a first run and in a second run code in the former structure. So, for the syntax maybe it is best to stay within the Lisp/Scheme philosophy "code as data" and use S-expressions for which one variant of the message exists already (this means, we also need the two delimiter symbols). I'm not sure how to convince ET that the two symbols are in fact delimiters as @Pavgran wrote.

Another thing is the ambiguity of binary representation of symbols and numbers, but I think this is no question of the alphabet, but of the encoding of the message data (maybe the prefixfree codes in #30 could help? Or one could use the binary equivalent of a random UUID during code compilation?). Does this make sense? What do you all think?

paulfitz / cosmicos

Decoding test #31