mlomb / chat-analytics

Generate interactive, beautiful and insightful chat analysis reports
https://chatanalytics.app
GNU Affero General Public License v3.0
711 stars 51 forks source link

[Feature request] Compress text using the letters they can contain #23

Closed hopperelec closed 1 year ago

hopperelec commented 1 year ago

I've not looked very far into this just yet, but I believe all text being stored in the database as Unicode, but not all of them can use all characters in Unicode (most only require 6 bits whereas Unicode requires a minimum of 8). For example, Discord user avatars can only contain lowercase letters or digits (36 characters or 6 bits), domain names can only contain letters, digits, hyphens and periods (special characters are produced with domain-specific codes, so only 64 characters or 6 bits), I believe emoji names can only contain letters, spaces and colons (the colons are only used for the start and end, so this could be used for further compression. Only 54 characters or 6 bits)

mlomb commented 1 year ago

Not worth it tbh, decompression times may be a problem and we lose the ability to peek at the generated JSON

hopperelec commented 1 year ago

If it can be implemented, I think it would be very much worth it. It looks like most of the database is strings and this would save at least a quarter of the data used by those strings. I don't know enough about how the database is being compressed to speak on decompression times, but for peeking at the generated JSON, could there not just be a dedicated class for it with toString overridden to display the plaintext and a function defined for serialization used when writing/compressing? the database?