joweich / chat-miner

Parsers and visualizations for chats
MIT License
567 stars 56 forks source link

Refactor telegram parser #97

Closed KianKhadempour closed 1 year ago

KianKhadempour commented 1 year ago

I am currently adding typing to the files to make it easier to use chat-miner. While doing this I read (for the first time) the code for the Telegram parser. The _read_raw_messages_from_file method is really good (other than a potential spot for a tqdm) but the _parse_message method is littered with assert and isinstance. In addition, I don't really understand what the code is doing.

For reference, here is the current code:

def _parse_message(self, mess: dict[str, Any]):
        if "from" in mess and "text" in mess:
            if isinstance(mess["text"], str):
                body = mess["text"]
            elif isinstance(mess["text"], list):
                assert all(
                    [
                        (isinstance(m, dict) and "text" in m) or isinstance(m, str)
                        for m in mess["text"]
                    ]
                )
                body = " ".join(
                    map(
                        lambda m: m["text"] if isinstance(m, dict) else m,
                        mess["text"],
                    )
                )
            else:
                raise ValueError(f"Unable to parse type {type(mess['text'])} in {mess}")

            time = dt.datetime.fromtimestamp(int(mess["date_unixtime"]))
            author = mess["from"]
            return ParsedMessage(time, author, body)
        # NOTE: I changed the default return value to None instead of False to standardize the failure case
        return None

I propose that someone who knows how the Telegram JSON file is structured tries to rewrite this method (perhaps @galatolofederico) so that it is more readable, type-safe (not actually but it makes it harder to make mistakes), and pythonic.

joweich commented 1 year ago

@BlueishTint I agree that this needs some refactoring. The available formats of telegram exports are referenced in #26

joweich commented 1 year ago

@BlueishTint fyi I refactored the telegram parser just now.

KianKhadempour commented 11 months ago

I didn't see this until now but thanks! It looks much better.