`ENCODING ERROR` - Githubissues

tech-savvy-guy commented 2 years ago

Hi. I am having an issue with FPDF2.

I have a database as shown below.

Now, I want to create a pdf using these data. However, the username field in the database has a different encoding. When the PDF is created everything is working except for the username encoding problem.

Can anyone help?

Lucas-C commented 2 years ago

Hi and welcome @TECH-SAVVY-GUY !

I can try to help, but could you provide some minimal Python code reproducing your problem please? Without knowing what methods are called with what values, I cannot do much...

tech-savvy-guy commented 2 years ago

Thanks for offering help @Lucas-C !

Actually, I am creating a Telegram Bot that uses the Firebase API as it's database framework.

Now, I want to store all the user interactions with my bot as logs on my database. I have already shared the general schema of my database.

Now, the problem is that various Telegram users have various kind of Names, with different fonts (to look fancy 😅). For example: This is my username ➤ 𝓢𝓸𝓱𝓪𝓶 𝓓𝓪𝓽𝓽𝓪

The above code uploads the logs to my database.

Now the function below retrieves the logs from the database and generates a PDF FILE using the FPDF2 module.

def send_logs():

    headers, data = ["NAME", "CHAT ID", "USERNAME", "COMMAND", "TIME", "DATE"], []

    logs = database.child("Users").child("User Logs").get()
    for log in logs.each():

        _chat_id_  =  str(log.val()["chat_id"])
        _username_ =  log.val()["username"]
        _name_     =  log.val()["name"]
        _time_     =  log.val()["time"]
        _date_     =  log.val()["date"]
        _cmd_      =  log.val()["command"]

        data.append([_name_, _chat_id_, _username_, _cmd_, _time_, _date_])

    pdf = FPDF()

    pdf.add_font("Roboto", "",
        "rs_normal.ttf", uni=True)

    pdf.add_font("Roboto", "B",
        "rs_bold.ttf", uni=True) 

    pdf.add_page()
    pdf.set_font("Roboto", "B", size=10)
    line_height = pdf.font_size * 2.5
    col_width = [pdf.epw / 6]
    col_width_list = [30, 10, 25, 16, 10, 10]

    for index, attr in enumerate(headers):
        col_width = (col_width_list[index] * pdf.epw) // 100
        pdf.multi_cell(col_width, line_height, attr, align="C", border=1, ln=3, max_line_height=pdf.font_size)
    pdf.ln(line_height)

    pdf.set_font("Roboto", size=8)

    for row in data:
        for index, datum in enumerate(row):
            col_width = (col_width_list[index] * pdf.epw) // 100
            pdf.multi_cell(col_width, line_height, datum, align="C",  border=1, ln=3, max_line_height=pdf.font_size)
        pdf.ln(line_height)

    pdf.output('logs.pdf')

Now, this produces a PDF as shown in my initial comment. Only problem is that the Name field is not displayed properly. I think this has something to do with the font I am using to generate the PDF File. I am using the Roboto Slab font.

I hope this information is sufficient. Do let me know, if you need anything else...

gmischler commented 2 years ago

Now, the problem is that various Telegram users have various kind of Names, with different fonts (to look fancy 😅). For example: This is my username ➤ 𝓢𝓸𝓱𝓪𝓶 𝓓𝓪𝓽𝓽𝓪

Your name tag is composed of rather exotic unicode characters:


➤	U+10148	Black Rightwards Arrowhead
	U+32	ASCII space
𝓢	U+120034	Mathematical Bold Script Capital S
𝓸	U+120056	Mathematical Bold Script Small O
𝓱	U+120049	Mathematical Bold Script Small H
𝓪	U+120042	Mathematical Bold Script Small A
𝓶	U+120054	Mathematical Bold Script Small M
	U+32	ASCII space
𝓓	U+120019	Mathematical Bold Script Capital D
𝓪	U+120042	Mathematical Bold Script Small A
𝓽	U+120061	Mathematical Bold Script Small T
𝓽	U+120061	Mathematical Bold Script Small T
𝓪	U+120042	Mathematical Bold Script Small A

The first one is from the Unicode subset "Dingbats", the others (besides the spaces) from "Mathematical Alphanumeric Symbols". Most normal fonts won't contain glyphs for those characters, so they are displayed as little rectangles. If you want them to display correctly, you'll have to analyze each one, find the respective fonts that can handle them, and add them to the PDF using those fonts.

Modern webbrowsers have this functionality built in, so that you can see them correctly on this page here. The Telegram client apparently does the same. But you can't really expect that from a self-described "simple" PDF library, so you'll have to roll your own solution for that.

Oh, and if you manage to figure out a general and complete solution, maybe you could contribute that as an extension to this project? :wink:

Btw.: A possible alternative approach would be to figure out the original characters that were substituted by those symbols. As long as only Mathematical Alphanumerical Symbols are involved, his would be a relatively short lookup table. But given the huge number of writing sytems supported by Unicode, finding a suitable font might be easier in the general case.

Lucas-C commented 2 years ago

@allcontributors please add @gmischler for question

allcontributors[bot] commented 2 years ago

@Lucas-C

I've put up a pull request to add @gmischler! :tada:

Lucas-C commented 2 years ago

This is really a perfect answer @gmischler, thank you!

I have some good news though: this feature is built-in in Python:

>>> import unicodedata
>>> unicodedata.normalize('NFKD', str)
'Soham Datta'

gmischler commented 2 years ago

Ah, so my conclusion at the end was only partially correct. I was aware of Pythons unicodedata, but didn't remember how powerful the normalize function was. As long as your users write their names in something that is based on the latin script, this should indeed do the trick.

As soon as someone decides to write their name in eg. hindi though, you'll still need to find a font containing those characters...

An now I remember something else I hadn't originally thought about: There are some open source fonts that cover a wide range of unicode characters. A popular one is for eg. GNU Unifont. If you use both normalization and a font like that, you might have covered most of your bases, and only rarely encounter one of those replacement rectangles.

Unless you want to preserve your "fancy styles". Then you'd have to play the font game all the way...

tech-savvy-guy commented 2 years ago

This is really a perfect answer @gmischler, thank you!

I have some good news though: this feature is built-in in Python:
>>> import unicodedata
>>> unicodedata.normalize('NFKD', str)
'Soham Datta'

Thanks for providing a solution! May I ask, what is the purpose of the "NKFD" arguement in the normalize function? I looked up the online documentation here, but didn't quite understand! 😅

tech-savvy-guy commented 2 years ago

Ah, so my conclusion at the end was only partially correct. I was aware of Pythons unicodedata, but didn't remember how powerful the normalize function was. As long as your users write their names in something that is based on the latin script, this should indeed do the trick.

As soon as someone decides to write their name in eg. hindi though, you'll still need to find a font containing those characters...

An now I remember something else I hadn't originally thought about: There are some open source fonts that cover a wide range of unicode characters. A popular one is for eg. GNU Unifont. If you use both normalization and a font like that, you might have covered most of your bases, and only rarely encounter one of those replacement rectangles.

Unless you want to preserve your "fancy styles". Then you'd have to play the font game all the way...

Thanks for the reply @gmischler !

I have one question: I am using the Roboto Slab font to create my PDF files. Now, isn't Roboto Slab a google font? So, technically it should cover all the unicode characters?

tech-savvy-guy commented 2 years ago

Actually, I was thinking of something else for this issue.

Let's assume we are creating a table in an Excel Sheet. Now, there is a default font that applies to all the cells in the document, right? But say, I want to edit a cell in particular. That particular cell can have a different font, right? Now, what if we use this idea and make all the cells under the Name column font independent? It's as if, we are using CTRL+C and CTRL+V for entering the data there. So there is no particular font that we have specified, and hence there will be no � in the PDF generated...

Can this method be implemented in any way?

gmischler commented 2 years ago

May I ask, what is the purpose of the "NKFD" arguement in the normalize function?

That controls the different ways how combined (usually accented) characters are handled. The Python manual gives a few examples.

I have one question: I am using the Roboto Slab font to create my PDF files. Now, isn't Roboto Slab a google font? So, technically it should cover all the unicode characters?

What does being a "Google font" (technically: commissioned by Google) have to do with the selection of glyphs covered? Very few fonts include dingbats and mathematical symbols, anyway. And if a character is shown as a rectangle, then it very obviously isn't included.

Now, what if we use this idea and make all the cells under the Name column font independent?

There can be no "font independent cells", neither in excel nor in a PDF. Any text necessarily has to have a font assigned (the "current font" in fpdf). If you CTRL+V in excel, then the font assignment of the originating cell is just copied together with the text. How you automate something like that in Python is up to you, though I suspect that neither Telegram nor your database will tell you which font to use. You'll have to figure that out with the help of the unicodedata library module and maybe some additional data.

tech-savvy-guy commented 2 years ago

Alright, thanks for clearing out my questions! 😄

py-pdf / fpdf2

`ENCODING ERROR` #299