jspoto / UnknownArtifact

2 stars 0 forks source link

Some of my findings on this one #8

Open jlouis opened 9 years ago

jlouis commented 9 years ago

I wrote a rather long thing on this one. I'll keep it as an issue, but it coincides well with your work, albeit takes a different path:

Working on the Uknown Artifact sound source

The unknown artifact makes many sounds, but we quickly found an odd "horn" sound in the recording, if we increased the speed of the recording to 200% of it's original speed (for a speedup factor of 2.00).

Real space probes, like Voyager 1/2, has the problem of data travel time and data noise. Unlike the internet, where 1 light-second covers all of the earth with no trouble whatsoever, we are not as lucky for a deep space probe. One of the voyager probes is so far out it takes hours for the signal to come to us. Hence, all space probes need to make sure that data they send are not lost if a few bits gets flipped or erased in the communication.

To that extent, they all use so-called error-correcting codes. That is, they send more bits for what they should, and then we can use those extra bits to correct for errors in the signal, or at least detect transfer was incorrect. In principle, we could just send the same message 5 times and then vote on each bit, but this is not an efficient way to utilize the available bandwidth. Hence more complicated encoding schemes were born.

They have names such as Golay coding, Reed-Solomon coding (which is also used in mobile phones), Viterbi coding, Hamming codes, and so on. I've been looking through these, but I have not found any code that makes sense for the signal yet, so I'm close to ruling out this as a viable option. The reason is that every artifact found until now emits a certain interesting pattern, but none of those patterns match typical coding schemes. I would expect unknown aliens to use a code pattern of some kind if they were communicating like the voyager/pioneer space probes golden record, or its cover.

We have collected signals from at least 5 different artifacts:

http://digitalscream.org.uk/audio/

and people have been listening and recording down hi/low tones (i'm using an 'x' where people disagree, a '.' where things were inaudible and a '*' where I think there is a missing entry).

UA1: 010011 0101011 1011001 0100110 0100101 1001010 1001001 0101011 x001001 1100110 1010110
UA2: *11011  100100  011011  110010  010110  010010  100110 0110101 0101100 0110010 1100110
UA3: 101100  101001 0101100 0011011 1011001 0010010 0010100 0100110 0110010 1001011 1010110

UA4: *00110 0011001  001100 0011010 1010110 0010011 0110110 1001101 1001100 1001011 101001.

UA5: 101101  011010  010101 0101011 1001001 0110010 1010110 0101010 0110011 0101001 1101001

This coding scheme has two peculiar observations:

Thus, in the above, the 'x' in UA1 has to be a 1 by this rule. It acts like a crude-error correcting code like in the space probes, but it certainly doesn't pack as well as a proper block and/or convolutional code like used in the real space probes. The idea here being that we can sometimes reconstruct a failing signal by using this rule to figure out the bit that must have come before it.

The distinct lack of signal randomness means we have a signal with information in it, and it has not been run through an encryption as that would have made the signal appear random. There is at least one structural layer we can pick out. Furthermore, pauses in the signal suggest there are 10-11 7bit words in the signal, perhaps with some bit left out of the above reconstructions.

The signal also has repeat words. This suggest that there is a structure to the signal which is even deeper.

We can build up some rules based on the rules observed above:

Otherwise, we break the rule of no-3-consecutive. This immediately asks the question: "how many such 7 digit codes are there?". Lo and behold, an Ocaml program later:

(* Elite Dangerous unknown artifact decoding tools *)
open Core.Std

module CodeCount = struct

  let code_count x =
    let rec cc t =
      if (List.length t) = x then [t]
      else
        match t with
        | [] -> List.append (cc [0]) (cc [1])
        | [e] -> List.append (cc [0;e]) (cc [1;e])
        | 1 :: 1 :: _ as ts -> cc(0 :: ts)
        | 0 :: 0 :: _ as ts -> cc(1 :: ts)
        | _ :: _ :: _ as ts -> List.append (cc (0 :: ts)) (cc (1 :: ts))
    in
      cc []

end

let () =
  let cc6 = List.length (CodeCount.code_count 6) in
  let cc7 = List.length (CodeCount.code_count 7) in
    printf "6: %i\n7: %i\n" cc6 cc7

This program builds up a list of the possible 0/1 combinations that are valid and then counts them. There are 26 6-length code words, and 42 7-length code words, the 6 bit length is suspicious since this is the amount of letters in the latin alphabet. The 42 length is interesting, but far from anything we know.

A frequency count on the code words for all UAs 1 through 5 of length 7 yields a count-map which is perhaps english text (but don't count on it!). There might be an avenue here by guessing this is really a substitution cipher where a subset of the 42 possible characters in the alphabet are used to write down english text. The most common sequences are:

((2(0 1 0 0 1 1 0))(3(0 1 0 1 0 1 1))(2(0 1 0 1 1 0 0))(2(0 1 1 0 0 1 0))(2(0 1 1 0 1 1))(4(1 0 0 1 0 0 1))(2(1 0 0 1 0 1 1))(2(1 0 0 1 1 0))(2(1 0 0 1 1 0 0))(2(1 0 1 0 0 1))(4(1 0 1 0 1 1 0))(2(1 0 1 1 0 0 1)))

Every other word occurs with a sequence of 1. There is definitely something to try here. I tried this, but nothing came of it.

General Structure:

The Each UA will output 11 "words" where we know each word to be either 6 or 7 bit in length. If we make the assumption some of the words are "lost" to the fuel scooping noise, this yields 77 bit of information before we scoop up the UA again.

If we take each "word" and convert it into a number, we get the following:

UA1: (34 3 23 26 35 36 4 3 4 37 6)
UA2: (28 4 28 29 30 31 11 32 21 5 33)
UA3: (20 19 21 22 23 24 25 26 27 18 6)
UA4: (11 12 13 14 6 15 16 17 13 18 19)
UA5: (0 1 2 3 4 5 6 7 8 9 10)

which shows there are only a few repetitions. Most words a "newly" encountered when we scan through the list. Of course, a problem here might be our ability to figure out what the tones are, so a mistake in tone counting yields a mistake in the above.

Word structure:

Words almost have a peculiar cyclical structure which is common in cyclical codes: If we "shift" a bit-pattern, it is still valid. Suppose we have "0101011" and we take the '1' from the end and cycle it to the front, like in "1010101". This is also a valid code. Many error-correcting codes are cyclical in nature, so this is indeed an interesting property of the numbers. There are some important counter-examples to this though:

0101100 cycles to 1011000
0010010 cycles to 0001001

which is not valid because it has 3 consecutive 0'es. So this is probably moot.

Words commonly have 3 0'es and 4 1'es or vice versa. But there are important counter examples to thisas well, as

0010010, 0010100 (from UA3)

both have a 2,5 split. It is very close to hold as a rule though, so I advise to relisten to the UA3 recording and make sure we get the number of 1'es correct on that recording. Especially because it is known that the numbers 3,4 are important to Thargoids according to history.

CMDR jlouis

jlouis commented 9 years ago

Spectral analysis

Plotting pairs of points which are next to each other, hoping for patterns, but nothing showed up:

2d-spectrum

jspoto commented 9 years ago

Interesting, and some great info.

The parity bit you describe above, is essentially a NAND of the two leading bits, for anyone looking to implement such a scheme. I hadn't thought about it until you mentioned it, but indeed there could be a parity bit that conforms to the "no runs > 3" rule

jlouis commented 9 years ago

http://www.reddit.com/r/EliteDangerous/comments/35c4oc/possible_uknown_artifact_decoding_breakthrough/

You may want to read this. It is a possible idea which juuust adds up enough for it to be interesting :)

jspoto commented 9 years ago

Hi Jlouis-

I saw that post, but didn't have time to comment. I thought there were some really good insights into the Fib sequence stuff...

One issue with such an encoding however, is the problem of encodings not being unique -- there's many ways to represent a value.

Interestingly, barcode formats may rely on a similar principle (2 of 5 for example)http://en.wikipedia.org/wiki/Interleaved_2_of_5, which even encodes numbers in pairs using the whitespace (complement). There are parallels with several things you call out.

Given this, I think the Fib connection was a mathematical insight involving the combinatorial possibilities, but perhaps thats all. But who knows, it may pay to keep those ideas handy...

I've upgraded the script (and data), and have begun a simple sequencing module as well, to help find repeating patterns. More data will help... we'll see what we find.

One thing I've found interesting, is that the "glyph" distribution doesn't appear even. But our sample may just be too small at present... If this continues with more data, then that would be another piece of evidence indicating a non-random signal.

Ultimately, we need to find a repeating pattern if possible. If not, then it means that the signal may have structure, but not meaning. Or that it's encrypted -- which is wholly possible, but would add a virtually unassailable level of difficulty in cracking this

jlouis commented 9 years ago

Yeah, it didn't pan out too well however. It matches, but this is probably not the encoding. I've also considered Manchester encoding, but it kind of breaks down since there is 7 bit in each group, and if we take it as a stream, we have a 000 in there which kind of messes that idea up.

I think I'll try some basic bitwise analysis instead to see if the data looks random at that level.