knadh / tg-archive

A tool for exporting Telegram group chats into static websites like mailing list archives.
MIT License
834 stars 124 forks source link

Store message content as HTML #50

Open faraazb opened 2 years ago

faraazb commented 2 years ago

Fixes #43

Messages are stored in the database as HTML. This preserves formatting such as bold, italic, underline, strikethrough, monospace and inline links. Telegram links in a message to other messages (t.me/group/message_id) are replaced with their archival site version. For example, t.me/example_group/12 becomes example_group/site/2022-02.html#12.

knadh commented 2 years ago

Thanks. Will test this soon.

Farzat07 commented 2 years ago

I tried this and it works, but I think the html template file should be edited to reflect the changes, as the html elements are not rendered.

The rss template though seems to be working just fine for now.

Farzat07 commented 2 years ago

Actually nevermind - I was using the old template for the html website. The new template actually does work just fine.

knadh commented 2 years ago

Sorry, just got a chance to look at this. URLs aren't being rendered as hyerplinks anymore.

Fresh site created using --new with this PR: image

Current master:

image

Farzat07 commented 2 years ago

Are you sure you deleted the database and then synced again? Because otherwise you would be just applying the new code/template on the old raw text messages.

knadh commented 2 years ago

I used an existing database, but that shouldn't break existing links on existing installations. Re-syncing large channels may be impractical.

  1. replace_msg_link() can be renamed to urlize() (like in the current version) and it can continue to convert non-<a> URLs to links along with replacing Telegram group links like it is doing right now.

  2. This PR also involves changing the template, which means all existing installations will break after an upgrade, which isn't ideal. Have to come up with a way to avoid this.

faraazb commented 2 years ago

I agree, starting over with large channels seems impractical. I could be missing something but what I understand is raw text should not be rendered without escaping and HTML cannot be escaped and I don't think it is possible to differentiate between raw text and HTML. I am unable to have a generalized urlize(). This would also lead to both HTML and raw text being stored in the database, which doesn't sound nice to me. I think we can have a 'formatted-message' config which is True by default, so that new sites preserve message formatting and the existing ones do not break. I will make the change and test it out. What do you guys think?

Farzat07 commented 2 years ago

IF we make such a setting, I believe it should be set to True by default in NEW configurations by adding it to the example config.yaml file. However, if the setting does not exist in the config.yaml file (i.e. started with an old version) it should assume it is False.

knadh commented 2 years ago

Yeah, an html_messages: true which is by default turned on for new setups should be fine.

faraazb commented 2 years ago

Thanks for the feedback! Have made the change.

knadh commented 2 years ago

Almost works! One last quirk. Syncing with html_messages: True saves HTML in the DB. If you then set it to False and rebuild the site, the HTML tags render as plaintext.

image

Farzat07 commented 2 years ago

Well that makes sense because the setting is meant to be constant; otherwise normally all projects should be set to use the html one. The real point of the setting is to not break previous setups by preventing a mix of text and html messages.

If this behaviour is confusing, one solution would be to add description next to the option about its nature and that it should not be changed after the first sync.

Another solution would be to remove the option entirely, and pull all new messages as html, regardless of the previous ones. Then, when generating the html/rss templates, check each message to see if it is text or html, and handle it accordingly. I believe a good check would be to check the datatype, but really any method should suffice.

Actually now that I think about it, the second solution makes way more sense but for some reason I didn't think of it before.

knadh commented 2 years ago

Yep, the second option is better. The True/False should only affect rendering HTML or escaped text.

faraazb commented 2 years ago

Thanks @farzat, but with the option you suggest we need to determine whether a message is raw text or HTML. Consider these example messages,

  1. Githubissues.
  2. Githubissues is a development platform for aggregating issues.