muccc / iridium-toolkit

A set of tools to parse Iridium frames
477 stars 111 forks source link

Non ascii characters in SBD mode #86

Closed alphapats closed 1 year ago

alphapats commented 2 years ago

reassembler.py in 'sbd' mode decodes ASCII characters to corresponding characters and rest are encoded as hex. This makes most of SBD data as garbled with no meaning. The code snippet from utils.py which converts int values to corresponding ASCII characters is as follows: if( c>=32 and c<127): str1+=chr(c) I investigated these hex values and found that they belong to other languages like arabic/french. str = str.replace(r'\x{e2}\x{80}\x{99}',"'") str = str.replace(r'\x{e2}\x{80}\x{a6}',"…") str = str.replace(r'\x{f4}',"ô") str = str.replace(r'\x{c0}','À') str = str.replace(r'\x{c7}',"Ç") str = str.replace(r'\x{ea}',"ê") str = str.replace(r'\x{f9}',"ù") str = str.replace(r'\x{80}',"€") str = str.replace(r'\x{20}\x{A3}',"₣") str = str.replace(r'\x{c2}',"Â") str = str.replace(r'\x{e8}',"è") str = str.replace(r'\x{c9}',"É") str = str.replace(r'\x{ca}',"Ê") How can we modify this code to view non ascii characters (french or arabic language). I tried to replace these non ascii hex values to corresponding characters but it is very time consuming. Is there any efficient way to convert these non ascii values to corresponding non english characters?

alphapats commented 2 years ago

I have modified the code of util.py to include arabic, french, punctuation, roman numerals, hindi etc : ` for c in data: if mask: c=c&0x7f if(c>=32 and c<126): str1+=chr(c)

elif( c in [128,130,132,135,136,137,138,139,145,146,147,148,149,152,153,154]):

    #    str1+=chr(c)
    elif c in [233, 224, 232, 249, 226, 234, 238, 244, 251, 231, 235, 239, 252]: #french
        str1+=chr(c)
        #print('french')
    elif(c>=8208 and c<=8231): #punctuation
        str1+=chr(c)
    elif(c>=8240 and c<=8231): #punctuation
        str1+=chr(c)
    elif(c>=8308 and c<=8334): #superscript
        str1+=chr(c)
    elif(c>=8531 and c<=8579): #roman
        str1+=chr(c)    
    elif (c >= 1569 and c<=1791): #arabic
        str1+=chr(c)
    elif (c>=3840 and c<=4047): #tibetan
        str1+=chr(c)
    elif (c>=8528 and c<=8579): #number
        str1+=chr(c)
    elif (c>=4096 and c<=4185):#mynamar
        str1+=chr(c)
    elif(c>=2305 and c<=2416): #hindi
        str1+=chr(c)
    elif(c>=3584 and c<=3675): #thai
        str1+=chr(c)
    elif(c>=880 and c<=1011): #greek
        str1+=chr(c)
    elif(c>=3458 and c<=3572): #sinhala
        str1+=chr(c)
    elif(c>=8448 and c<=8506): #letterlikesymbol
        str1+=chr(c)
    else:
        if dot:
            str1+="."
        elif escape:
            if c==0x0d:
                str1+='\\r'
            elif c==0x0a:
                str1+='\\n'
            else:

                str1+='\\x{%02x}'%c    
        else:
            str1+="[%02x]"%c

`

Sec42 commented 2 years ago

Hi,

sbd data is m2m (machine-to-machine) communication. So most of the communication will be in binary and without knowledge of the protocol and/or the participating endpoints it is difficult to understand.

I don't think blindly printing characters will help with understanding these protocols.

If you have concrete examples where this change helps understanding a protocol, please let me know.

alphapats commented 2 years ago

I have got few Short Burst Data msgs when using -m sbd. It does contain msg content which is sent from machine terminal to other over sbd mode. If its ascii, its readable in english. If msg sent in some other language then it prints hex values. 04-06-2022T17:39:46,DL,<26:02:5b:01:00:47:96>,\x{87})C*\x{d9}#I\x{e2}€\x{99}ll check now. Yesterday was 118Q\x{01}R\x{01}U\x{d3}\x{00}\x{00}\x{01}\x{81}.\x{9e}\x{97}\x{bb}C\x{c4}\x{06}\x{13}\x{07}i\x{04}\x{83}O@\x{c4}\x{06}\x{17} pSx\x{8f} 04-06-2022T17:44:08,DL,<26:02:5c:02:00:19:cf>,\x{87})C*\x{d9}\x{d1}145 opened 21 clicked on various links but some of those links were to Wikipedia.. so approx 15 clicked on actual trips. 2 unsubscribed. I guess we will not know the results until you can check your mailbox Q\x{01}R\x{01}U\x{d3}\x{00}\x{00}\x{01}\x{81}.\x{a2}J\x{b4}C\x{c4}\x{06}\x{13}\x{07}i\x{04}\x{83}O@\x{c4}\x{06}\x{17} pSx\x{8f} Above example is in english language, I also found out some msgs which were in french/spanish. So msg was readable for those spanish/ french characters which were common in english (falls in ascii range) for rest, it was showing hex so I tried to replace hex with its corresponding french/spanish character and i was able to get complete message.

PS: Out of 200-300 msgs, only 5-10 msgs contains readable text. Rest all comes in hex.

Sec42 commented 1 year ago

I understand where you're coming from. Unfortunately without knowing the code page/encoding mappings like these will just amount to guessing.

Case in point: most of your code references codepoints > 255 . which can't happen since the message is parsed byte-wise.

Your decoding of the "french" characters works more or less by accident, since the iso-8859-1 standard (which is what I guess is being used in your case) matches the first 256 characters of unicode (which is what chr() uses).

I guess decoding/displaying the accented characters of iso-8859-1 would not do much harm, and just be mildly confusing. I'll test it for a bit & see how I feel about it.

However implementing speculative decoding of utf-8 (or other multi-byte encodings) is definitely out of scope here.