py-pdf / fpdf2

Simple PDF generation for Python
https://py-pdf.github.io/fpdf2/
GNU Lesser General Public License v3.0
1.08k stars 247 forks source link

I have a html link in a paragraph that cannot be converted in PDF link #505

Closed me-suzy closed 2 years ago

me-suzy commented 2 years ago

<p class="text_obisnuit">Intr-un articol precedent, <a href="https://neculaifantanaru.com/dupa-toate-regulile-artei.html"><em>Dupa toate regulile artei</em></a>, v-am povestit despre tanarul print Hamlet

shoult look like this in PDF

Intr-un articol precedent, Dupa toate regulile artei, v-am povestit despre tanarul print Hamlet Instead of that, this is how it looks in PDF (also, in PDF, as you se below, the signs of `href=https` disappeared `://` ![image](https://user-images.githubusercontent.com/2770489/186769126-d514e2ce-dc84-4974-be84-c7ad26d2a79e.png)

Lucas-C commented 2 years ago

As already stated in this comment:

please share some fully-autonomous minimal reproducible example so that we can replicate your problem.

If you do not provide us with some minimal Python code, we won't be able to help you much.

I was able to execute the following code without reproducing the issue you mentioned:

from fpdf import fpdf, html

class PDF(fpdf.FPDF, html.HTMLMixin):
    pass

pdf = PDF()
pdf.add_page()
pdf.add_font("Kanit", fname="fonts/Kanit-Regular.ttf")
pdf.add_font("Kanit", style="I", fname="fonts/Kanit-Italic.ttf")
pdf.set_font("Kanit", size=24)
pdf.write_html('<p class="text_obisnuit">Intr-un articol precedent, <a href="https://neculaifantanaru.com/dupa-toate-regulile-artei.html"><em>Dupa toate regulile artei</em></a>, v-am povestit despre tanarul print Hamlet</p>')
pdf.output("issue_498.pdf")
me-suzy commented 2 years ago

This is the complete PYTHON code.

1. It must also be taken into account that the signs : are lost in PDF, also the uppercase letter at the beginning of the line:

For exemple:

Leadership: Takes into account the opinions of others in order to understand...

in PDF looks like this:

Leadership takes into account the opinions of others in order to understand...

2. Link problem as I showed above.

3. The tag into the paragraph, as I showed in the previous bug.

HTML: <p class="text_obisnuit2"><em>My Name is Prince</em></p>

IN PDF the second tag is still there, like this.

<em>My Name is Prince</em>

Here is an example of one of my html pages. Copy it on a html file, and test it. You can duplicate this html code in many pages you want, because I made a merge PDF also in python code (that works great)

https://hastebin.com/puxecelivi.http

MY PYTHON CODE:

from fpdf import fpdf, html
import os
import re

from PyPDF2 import PdfFileMerger

def read_text_from_file(file_path):
    """
    Aceasta functie returneaza continutul unui fisier.
    file_path: calea catre fisierul din care vrei sa citesti
    """
    with open(file_path, encoding='utf8', errors='ignore') as f:
        text = f.read()
        f.close()
        return text

def write_to_file(text, file_path):
    """
    Aceasta functie scrie un text intr-un fisier.
    text: textul pe care vrei sa il scrii
    file_path: calea catre fisierul in care vrei sa scrii
    """
    with open(file_path, 'wb') as f:
        f.write(text.encode('utf8', 'ignore'))
        f.close()

dict_simboluri = dict()
dict_simboluri['&#259;'] = 'a'
dict_simboluri['&#226;'] = 'a'

def save_to_pdf(directory_path):
    for root, dirs, files in os.walk(directory_path):
        for file_name in files:
            if file_name.endswith(".html"):
                file_path = root + os.sep + file_name
                file_content = read_text_from_file(file_path)

                # creare fisier PDF
                class PDF(fpdf.FPDF, html.HTMLMixin):
                    pass

                pdf = PDF()
                pdf.add_page()
                pdf.set_font('helvetica', size=12)

                # extras denumire articol
                den_articol = re.search('<td><h1 class="den_articol" itemprop="name">(.*?)</h1></td>', file_content)
                if (den_articol == None):
                    print("Nu am gasit --- denumire articol --- in fisierul --- {} ---.".format(file_path))
                else:
                    den_articol = den_articol.group(1)
                    for simbol in dict_simboluri.keys():
                        den_articol = den_articol.replace(simbol, dict_simboluri[simbol])

                pdf.set_text_color(204, 0, 0) # rosu
                pdf.set_font('helvetica', size=14, style="B")
                pdf.multi_cell(w=190, txt=den_articol)
                pdf.ln()
                pdf.set_font('helvetica', size=12)

                # extras data
                date = re.search('<td class="text_dreapta">(.*?), in <a', file_content)
                if (date == None):
                    print("Nu am gasit --- date --- in fisierul --- {} ---.".format(file_path))
                else:
                    date = date.group(1)

                pdf.set_text_color(0, 102, 204) # albastru
                pdf.set_font('helvetica', size=8, style="B")
                pdf.cell(txt=date)
                pdf.ln()
                pdf.ln()
                pdf.ln()
                pdf.ln()
                pdf.set_text_color(0, 0, 0) # negru (default)
                pdf.set_font('helvetica', size=12)

                # extras text
                articol = re.search('<!-- ARTICOL START -->([\s\S]*?)<!-- ARTICOL FINAL -->', file_content)
                if (articol == None):
                    print("Nu am gasit --- ARTICOL START/FINAL --- in fisierul --- {} ---.".format(file_path))
                else:
                    articol = articol.group(1)
                    articol = articol.replace("&quot;", "\"")
                    articol = articol.replace("&rsquo;", "'")

                    # paragraphs
                    par_regex = re.compile('<p class="text_obisnuit.*?">.*?</p>')
                    pars = re.findall(par_regex, articol)
                    pars_text = list()

                    if (len(pars) == 0):
                        print("Nu am gasit -- paragrafe text_obisnuit -- in fisierul --- {} ---.".format(file_path))
                    else:
                        for i in range(0, len(pars)):
                            if ('<p class="text_obisnuit">' in pars[i]):

                                # identificam clasa text_obisnuit si preluam textul
                                content = re.findall('<p class="text_obisnuit">(.*?)</p>', pars[i])
                                if (len(content) == 0):
                                    print("Nu am gasit text in paragraful {}, fisierul {}.".format(pars[i], file_path))
                                else:

                                    # punem textul intr-o celula multi_cell
                                    for simbol in dict_simboluri.keys():
                                        content[0] = content[0].replace(simbol, dict_simboluri[simbol])
                                    pars_text.append(content[0])
                                    pdf.multi_cell(w=190, txt = content[0])

                                    # adaugam linie goala intre paragrafe
                                    pdf.ln();
                            elif ('<p class="text_obisnuit2">' in pars[i]):

                                # identificam clasa text_obisnuit2 si preluam textul
                                content = re.findall('<p class="text_obisnuit2">(.*?)</p>', pars[i])
                                if (len(content) == 0):
                                    print("Nu am gasit text in paragraful {}, fisierul {}.".format(pars[i], file_path))
                                else:

                                    # setam fontul cu bold
                                    pdf.set_font('helvetica', size=12, style="B")

                                    # punem textul intr-o celula multi_cell
                                    for simbol in dict_simboluri.keys():
                                        content[0] = content[0].replace(simbol, dict_simboluri[simbol])
                                    pars_text.append(content[0])
                                    pdf.multi_cell(w=190, txt = content[0])

                                    # adaugam linie goala intre paragrafe
                                    pdf.ln();

                                    # resetam fontul
                                    pdf.set_font('helvetica', size=12)
                            else:
                                continue

                    # adaugare link
                    pdf.ln()
                    pdf.ln()
                    pdf.set_font('helvetica', size=12, style="B")
                    pdf.cell(txt="Source:")
                    pdf.set_font('helvetica', size=12)
                    pdf.set_text_color(0, 102, 204) # albastru
                    pdf.cell(w=40, txt="https://neculaifantanaru.com/{}".format(file_name), link="https://neculaifantanaru.com/{}".format(file_name))

                    den_fisier = file_path.split('.')[0] + '.pdf'
                    pdf.output(den_fisier)
                    # break;

# functie care face merge la mai multe fisiere pdf
def merge_pdf_files(directory_path):
    merger = PdfFileMerger()
    for root, dirs, files in os.walk(directory_path):
        for file_name in files:
            if file_name.endswith(".pdf"):
                print("PDF: ", file_name)
                file_path = root + os.sep + file_name
                merger.append(file_path)
        merger.write(root + os.sep + "articles.pdf")
        merger.close()
        break;

save_to_pdf("c:\\Folder5\\")
merge_pdf_files("c:\\Folder5\\")
RedShy commented 2 years ago

Hi @me-suzy! If I understood correctly, the issue arises because you don't seem to use pdf.write_html() For example, with this code

def issue():
  file_path = "./issue.html"
  file_content = read_text_from_file(file_path)

  # creare fisier PDF
  class PDF(fpdf.FPDF, html.HTMLMixin):
      pass

  pdf = PDF()
  pdf.add_page()

  # extras denumire articol
  den_articol = re.search('<td><h1 class="den_articol" itemprop="name">(.*?)</h1></td>', file_content)
  if (den_articol == None):
      print("Nu am gasit --- denumire articol --- in fisierul --- {} ---.".format(file_path))
  else:
      den_articol = den_articol.group(1)
      for simbol in dict_simboluri.keys():
          den_articol = den_articol.replace(simbol, dict_simboluri[simbol])

  pdf.set_text_color(204, 0, 0) # rosu
  pdf.add_font("Kanit", fname="fonts/Kanit-Regular.ttf")
  pdf.set_font('Kanit', size=14)
  pdf.multi_cell(w=190, txt=den_articol)
  pdf.ln()

  pdf.output("issue.pdf")

the header is shown like this immagine Instead if I change

pdf.multi_cell(w=190, txt=den_articol)

to

pdf.write_html(text=f'<h1 class="den_articol" itemprop="name">{den_articol}</h1>')

the header seems to be shown correctly immagine

Pay attention also that with the helvetica font, fpdf2 complained that helvetica doesn't support the ş character and I had to switch to Kanit

Lucas-C commented 2 years ago

Thank you for jumping in with this great answer @RedShy!

@all-contributors please add @RedShy for question

allcontributors[bot] commented 2 years ago

@Lucas-C

I've put up a pull request to add @RedShy! :tada:

me-suzy commented 2 years ago

It is not about the html TITLE tag. It is about the tags from paragraph. See this. I pointed the problem:

image

me-suzy commented 2 years ago

I change, but is exactly the same thing. You see at my title the diacritical marks &#351; from Abracadabra, cine e&#351;ti

This is why I made a dict_simboluri at the beginnind of the Python code, as to transform automaticaly &#351; into ă

image

allcontributors[bot] commented 2 years ago

@me-suzy

I've put up a pull request to add @RedShy! :tada:

RedShy commented 2 years ago

For rendering correctly also the paragraphs you should change the respectives lines at well. For example in this section

# identificam clasa text_obisnuit si preluam textul
content = re.findall('<p class="text_obisnuit">(.*?)</p>', pars[i])
if (len(content) == 0):
    print("Nu am gasit text in paragraful {}, fisierul {}.".format(pars[i], file_path))
else:

    # punem textul intr-o celula multi_cell
    for simbol in dict_simboluri.keys():
        content[0] = content[0].replace(simbol, dict_simboluri[simbol])
    pars_text.append(content[0])
    pdf.multi_cell(w=190, txt = content[0])

I changed

pdf.multi_cell(w=190, txt = content[0])

to

pdf.write_html(text=f'<p class="text_obisnuit">{content[0]}</p>')

and in this other section

  # identificam clasa text_obisnuit2 si preluam textul
  content = re.findall('<p class="text_obisnuit2">(.*?)</p>', pars[i])
  if (len(content) == 0):
      print("Nu am gasit text in paragraful {}, fisierul {}.".format(pars[i], file_path))
  else:

      # setam fontul cu bold
      pdf.set_font('Kanit', size=12, style="B")

      # punem textul intr-o celula multi_cell
      for simbol in dict_simboluri.keys():
          content[0] = content[0].replace(simbol, dict_simboluri[simbol])
      pars_text.append(content[0])
      pdf.multi_cell(w=190, txt = content[0])

I changed

pdf.multi_cell(w=190, txt = content[0])

to

pdf.write_html(text=f'<p class="text_obisnuit2">{content[0]}</p>')

A segment of resulting PDF that I obtain is this immagine

In general if you want to add html tags to the PDF you have to use the pdf.write_html() function.

I used the latest version of fpdf2 installed executing pip install git+https://github.com/PyFPDF/fpdf2.git@master

If you have any more doubts, feel free to keep asking!

me-suzy commented 2 years ago

I made those 2 change, and I get this error:

image

me-suzy commented 2 years ago

also, I get the second error, after change the second line of yours:

image

Lucas-C commented 2 years ago

Providing a screenshot of your IDE with a line of code in red is not very helpful... A full error stacktrace would be a lot more useful.

Also, you did not provide any minimal code associated with the last errors you faced: how do you expect us to help you without sharing the underlying code triggering the problem?

Other fpdf2 contributors may have suggestions to help you, and I thank them for their patience and will to help!

As for myself, I'm sorry but I won't try to figure out what the problem is without seeing any code, nor take the time to read through all the previous 150+ lines of code you provided. The idea of a writing a minimal code sample reproducing the problem is that you take the time to narrow the issue to something "atomic", easy to analyze and reason upon, before asking other people for help. You can find more information about how to proceed there: https://stackoverflow.com/help/minimal-reproducible-example

I'll be glad to help you if you take the time to provide a minimal reproducible example and the associated full stacktrace

me-suzy commented 2 years ago
C:\Users\Castel\AppData\Roaming\Python\Python310\site-packages\fpdf\fpdf.py:1904: UserWarning: Substituting font arial by core font helvetica
  warnings.warn(
PDF:  abordarea-frontala-a-lucrurilor-neelucidate.pdf
PDF:  abracadabra-cine-esti.pdf
PDF:  accente-pronuntate-in-leadership.pdf
>>> 
*** Remote Interpreter Reinitialized ***
C:\Users\Castel\AppData\Roaming\Python\Python310\site-packages\fpdf\fpdf.py:1904: UserWarning: Substituting font arial by core font helvetica
  warnings.warn(
Traceback (most recent call last):
  File "C:\Folder5\Convert all html to PDF in a single book - BEBE.py", line 281, in <module>
    save_to_pdf("c:\\Folder5\\")
  File "C:\Folder5\Convert all html to PDF in a single book - BEBE.py", line 226, in save_to_pdf
    pdf.write_html(text=f'<p class="text_obisnuit">{content[0]}</p>')
  File "C:\Users\Castel\AppData\Roaming\Python\Python310\site-packages\fpdf\html.py", line 736, in write_html
    h2p.feed(text)
  File "C:\Program Files\Python39\lib\html\parser.py", line 110, in feed
    self.goahead(0)
  File "C:\Program Files\Python39\lib\html\parser.py", line 170, in goahead
    k = self.parse_starttag(i)
  File "C:\Program Files\Python39\lib\html\parser.py", line 344, in parse_starttag
    self.handle_starttag(tag, attrs)
  File "C:\Users\Castel\AppData\Roaming\Python\Python310\site-packages\fpdf\html.py", line 447, in handle_starttag
    self.href = attrs["href"]
KeyError: 'href'
me-suzy commented 2 years ago

So, I change all styles ARIAL, TIMES, KANIT, I get the same error:

fpdf.errors.FPDFUnicodeEncodingException: Character "ă" at index 45 in text is outside the range of characters supported by the font used: "helvetica". Please consider using a Unicode font.

me-suzy commented 2 years ago

AFTER UPDATE MY CODE WITH NEW FONT and modify those 2 lines, I get this error (I didn't have thise error before the change):

*** Remote Interpreter Reinitialized ***
C:\Users\Castel\AppData\Roaming\Python\Python310\site-packages\fpdf\fpdf.py:1799: UserWarning: Core font or font already added 'kanit': doing nothing
  warnings.warn(f"Core font or font already added '{fontkey}': doing nothing")
Traceback (most recent call last):
  File "C:\Folder5\Convert all html to PDF in a single book - BEBE.py", line 175, in <module>
    save_to_pdf("c:\\Folder5\\")
  File "C:\Folder5\Convert all html to PDF in a single book - BEBE.py", line 65, in save_to_pdf
    pdf.set_font('Kanit', size=14, style="B")
  File "C:\Users\Castel\AppData\Roaming\Python\Python310\site-packages\fpdf\fpdf.py", line 1931, in set_font
    raise FPDFException(
fpdf.errors.FPDFException: Undefined font: kanitB - Use built-in fonts or FPDF.add_font() beforehand
>>> 

THIS IS MY LAST VERSION OF PYTHON CODE:

from fpdf import fpdf, html
import os
import re

from PyPDF2 import PdfFileMerger

def read_text_from_file(file_path):
    """
    Aceasta functie returneaza continutul unui fisier.
    file_path: calea catre fisierul din care vrei sa citesti
    """
    with open(file_path, encoding='utf8', errors='ignore') as f:
        text = f.read()
        f.close()
        return text

def write_to_file(text, file_path):
    """
    Aceasta functie scrie un text intr-un fisier.
    text: textul pe care vrei sa il scrii
    file_path: calea catre fisierul in care vrei sa scrii
    """
    with open(file_path, 'wb') as f:
        f.write(text.encode('utf8', 'ignore'))
        f.close()

dict_simboluri = dict()
dict_simboluri['&#259;'] = 'a'
dict_simboluri['&#226;'] = 'a'
dict_simboluri['&atilde;'] = 'a'
dict_simboluri['&acirc;'] = 'a'
dict_simboluri['&#x103;'] = 'a'
dict_simboluri['&#xE2;'] = 'a'

def save_to_pdf(directory_path):
    for root, dirs, files in os.walk(directory_path):
        for file_name in files:
            if file_name.endswith(".html"):
                file_path = root + os.sep + file_name
                file_content = read_text_from_file(file_path)

                # creare fisier PDF
                class PDF(fpdf.FPDF, html.HTMLMixin):
                    pass

                pdf = PDF()
                pdf.add_font("Kanit", fname="fonts/Kanit-Regular.ttf")
                pdf.add_font("Kanit", fname="fonts/Kanit-Bold.ttf")
                pdf.add_font("Kanit", style="I", fname="fonts/Kanit-Italic.ttf")
                pdf.set_font("Kanit", size=24)

                # extras denumire articol
                den_articol = re.search('<td><h1 class="den_articol" itemprop="name">(.*?)</h1></td>', file_content)
                if (den_articol == None):
                    print("Nu am gasit --- denumire articol --- in fisierul --- {} ---.".format(file_path))
                else:
                    den_articol = den_articol.group(1)
                    for simbol in dict_simboluri.keys():
                        den_articol = den_articol.replace(simbol, dict_simboluri[simbol])

                pdf.set_text_color(204, 0, 0) # rosu
                pdf.set_font('Kanit', size=14, style="B")
                pdf.multi_cell(w=190, txt=den_articol)
                pdf.ln()
                pdf.set_font('Kanit', size=12)

                # extras data
                date = re.search('<td class="text_dreapta">(.*?), in <a', file_content)
                if (date == None):
                    print("Nu am gasit --- date --- in fisierul --- {} ---.".format(file_path))
                else:
                    date = date.group(1)

                pdf.set_text_color(0, 102, 204) # albastru
                pdf.set_font('Kanit', size=8, style="B")
                pdf.cell(txt=date)
                pdf.ln()
                pdf.ln()
                pdf.ln()
                pdf.ln()
                pdf.set_text_color(0, 0, 0) # negru (default)
                pdf.set_font('Kanit', size=12)

                # extras text
                articol = re.search('<!-- ARTICOL START -->([\s\S]*?)<!-- ARTICOL FINAL -->', file_content)
                if (articol == None):
                    print("Nu am gasit --- ARTICOL START/FINAL --- in fisierul --- {} ---.".format(file_path))
                else:
                    articol = articol.group(1)
                    articol = articol.replace("&quot;", "\"")
                    articol = articol.replace("&rsquo;", "'")

                    # paragraphs
                    par_regex = re.compile('<p class="text_obisnuit.*?">.*?</p>')
                    pars = re.findall(par_regex, articol)
                    pars_text = list()

                    if (len(pars) == 0):
                        print("Nu am gasit -- paragrafe text_obisnuit -- in fisierul --- {} ---.".format(file_path))
                    else:
                        for i in range(0, len(pars)):
                            if ('<p class="text_obisnuit">' in pars[i]):

                                # identificam clasa text_obisnuit si preluam textul
                                content = re.findall('<p class="text_obisnuit">(.*?)</p>', pars[i])
                                if (len(content) == 0):
                                    print("Nu am gasit text in paragraful {}, fisierul {}.".format(pars[i], file_path))
                                else:

                                    # punem textul intr-o celula multi_cell
                                    for simbol in dict_simboluri.keys():
                                        content[0] = content[0].replace(simbol, dict_simboluri[simbol])
                                    pars_text.append(content[0])
                                    # pdf.multi_cell(w=190, txt = content[0])
                                    pdf.write_html(text=f'<p class="text_obisnuit">{content[0]}</p>')

                                    # adaugam linie goala intre paragrafe
                                    pdf.ln();
                            elif ('<p class="text_obisnuit2">' in pars[i]):

                                # identificam clasa text_obisnuit2 si preluam textul
                                content = re.findall('<p class="text_obisnuit2">(.*?)</p>', pars[i])
                                if (len(content) == 0):
                                    print("Nu am gasit text in paragraful {}, fisierul {}.".format(pars[i], file_path))
                                else:

                                    # setam fontul cu bold
                                    pdf.set_font('Kanit', size=12, style="B")

                                    # punem textul intr-o celula multi_cell
                                    for simbol in dict_simboluri.keys():
                                        content[0] = content[0].replace(simbol, dict_simboluri[simbol])
                                    pars_text.append(content[0])
                                    # pdf.multi_cell(w=190, txt = content[0])
                                    pdf.write_html(text=f'<p class="text_obisnuit2">{content[0]}</p>')

                                    # adaugam linie goala intre paragrafe
                                    pdf.ln();

                                    # resetam fontul
                                    pdf.set_font('Kanit', size=12)
                            else:
                                continue

                    # adaugare link
                    pdf.ln()
                    pdf.ln()
                    pdf.set_font('Kanit', size=12, style="B")
                    pdf.cell(txt="Source:")
                    pdf.set_font('Kanit', size=12)
                    pdf.set_text_color(0, 102, 204) # albastru
                    pdf.cell(w=40, txt="https://neculaifantanaru.com/{}".format(file_name), link="https://neculaifantanaru.com/{}".format(file_name))

                    den_fisier = file_path.split('.')[0] + '.pdf'
                    pdf.output(den_fisier)
                    # break;

# functie care face merge la mai multe fisiere pdf
def merge_pdf_files(directory_path):
    merger = PdfFileMerger()
    for root, dirs, files in os.walk(directory_path):
        for file_name in files:
            if file_name.endswith(".pdf"):
                print("PDF: ", file_name)
                file_path = root + os.sep + file_name
                merger.append(file_path)
        merger.write(root + os.sep + "articles.pdf")
        merger.close()
        break;

save_to_pdf("c:\\Folder5\\")
merge_pdf_files("c:\\Folder5\\")
RedShy commented 2 years ago

When you add a bold version of a font, you need to put also style="B", so try to change pdf.add_font("Kanit", fname="fonts/Kanit-Bold.ttf") to pdf.add_font("Kanit", style="B", fname="fonts/Kanit-Bold.ttf").

Also add pdf.add_page() under pdf = PDF()

me-suzy commented 2 years ago

ALMOST PERFECT !!

Except one thing. The bold font does not stand out

image

In my python code, I setup <p class="text_obisnuit2"> as to be BOLD, but it sees only italic. Must be both, BOLD and ITALIC.

The bold font does not stand out, maybe because of the Kanit style font itself?

In html, the first line is like this:

<p class="text_obisnuit2"><em>Pentru a cunoa&#351;te realitatea un lider trebuie s&#259; de&#355;in&#259; &#351;i arta disimul&#259;rii &ndash; o arm&#259; de temut, dar eficient&#259; &icirc;n cele mai multe situa&#355;ii.</em></p

THE CODE VERSION 5 (almost perfect)

from fpdf import fpdf, html
import os
import re

from PyPDF2 import PdfFileMerger

def read_text_from_file(file_path):
    """
    Aceasta functie returneaza continutul unui fisier.
    file_path: calea catre fisierul din care vrei sa citesti
    """
    with open(file_path, encoding='utf8', errors='ignore') as f:
        text = f.read()
        f.close()
        return text

def write_to_file(text, file_path):
    """
    Aceasta functie scrie un text intr-un fisier.
    text: textul pe care vrei sa il scrii
    file_path: calea catre fisierul in care vrei sa scrii
    """
    with open(file_path, 'wb') as f:
        f.write(text.encode('utf8', 'ignore'))
        f.close()

dict_simboluri = dict()
dict_simboluri['&#259;'] = 'ă'
dict_simboluri['&#226;'] = 'â'
dict_simboluri['&atilde;'] = 'ã'
dict_simboluri['&acirc;'] = 'â'
dict_simboluri['&#x103;'] = 'ă'
dict_simboluri['&#xE2;'] = 'a'

dict_simboluri['  '] = ' '

dict_simboluri['&icirc;'] = 'î'
dict_simboluri['&#206;'] = 'Î'
dict_simboluri['&#238;'] = 'î'
dict_simboluri['&#xEE;'] = 'î'
dict_simboluri['&#xCE;'] = 'Î'
dict_simboluri['&#206;'] = 'Î'
dict_simboluri['&#xEE;'] = 'î'
dict_simboluri['&#xCE;'] = 'i'
dict_simboluri['&Icirc;'] = 'Î'

dict_simboluri['&nbsp;'] = ' '

dict_simboluri['&#537;'] = 'ș'
dict_simboluri['&#536;'] = 'Ș'
dict_simboluri['&#350;'] = 'Ş'
dict_simboluri['&#x219;'] = 'ș'
dict_simboluri['&#351;'] = 'ș'

dict_simboluri['&amp;'] = ''

dict_simboluri['&#539;'] = 'ț'
dict_simboluri['&#355;'] = 'ț'
dict_simboluri['&#354;'] = 'Ţ'
dict_simboluri['&#x21B;'] = 'ț'

def save_to_pdf(directory_path):
    for root, dirs, files in os.walk(directory_path):
        for file_name in files:
            if file_name.endswith(".html"):
                file_path = root + os.sep + file_name
                file_content = read_text_from_file(file_path)

                # creare fisier PDF
                class PDF(fpdf.FPDF, html.HTMLMixin):
                    pass

                pdf = PDF()
                pdf.add_page()
                pdf.add_font("Kanit", fname="fonts/Kanit-Regular.ttf")
                pdf.add_font("Kanit", style="B", fname="fonts/Kanit-Bold.ttf")
                pdf.add_font("Kanit", style="I", fname="fonts/Kanit-Italic.ttf")
                pdf.set_font("Kanit", size=24)

                # extras denumire articol
                den_articol = re.search('<td><h1 class="den_articol" itemprop="name">(.*?)</h1></td>', file_content)
                if (den_articol == None):
                    print("Nu am gasit --- denumire articol --- in fisierul --- {} ---.".format(file_path))
                else:
                    den_articol = den_articol.group(1)
                    for simbol in dict_simboluri.keys():
                        den_articol = den_articol.replace(simbol, dict_simboluri[simbol])

                pdf.set_text_color(204, 0, 0) # rosu
                pdf.set_font('Kanit', size=14, style="B")
                pdf.multi_cell(w=190, txt=den_articol)
                pdf.ln()
                pdf.set_font('Kanit', size=12)

                # extras data
                date = re.search('<td class="text_dreapta">(.*?), in <a', file_content)
                if (date == None):
                    print("Nu am gasit --- date --- in fisierul --- {} ---.".format(file_path))
                else:
                    date = date.group(1)

                pdf.set_text_color(0, 102, 204) # albastru
                pdf.set_font('Kanit', size=8, style="B")
                pdf.cell(txt=date)
                pdf.ln()
                pdf.ln()
                pdf.ln()
                pdf.ln()
                pdf.set_text_color(0, 0, 0) # negru (default)
                pdf.set_font('Kanit', size=12)

                # extras text
                articol = re.search('<!-- ARTICOL START -->([\s\S]*?)<!-- ARTICOL FINAL -->', file_content)
                if (articol == None):
                    print("Nu am gasit --- ARTICOL START/FINAL --- in fisierul --- {} ---.".format(file_path))
                else:
                    articol = articol.group(1)
                    articol = articol.replace("&quot;", "\"")
                    articol = articol.replace("&rsquo;", "'")

                    # paragraphs
                    par_regex = re.compile('<p class="text_obisnuit.*?">.*?</p>')
                    pars = re.findall(par_regex, articol)
                    pars_text = list()

                    if (len(pars) == 0):
                        print("Nu am gasit -- paragrafe text_obisnuit -- in fisierul --- {} ---.".format(file_path))
                    else:
                        for i in range(0, len(pars)):
                            if ('<p class="text_obisnuit">' in pars[i]):

                                # identificam clasa text_obisnuit si preluam textul
                                content = re.findall('<p class="text_obisnuit">(.*?)</p>', pars[i])
                                if (len(content) == 0):
                                    print("Nu am gasit text in paragraful {}, fisierul {}.".format(pars[i], file_path))
                                else:

                                    # punem textul intr-o celula multi_cell
                                    for simbol in dict_simboluri.keys():
                                        content[0] = content[0].replace(simbol, dict_simboluri[simbol])
                                    pars_text.append(content[0])
                                    # pdf.multi_cell(w=190, txt = content[0])
                                    pdf.write_html(text=f'<p class="text_obisnuit">{content[0]}</p>')

                                    # adaugam linie goala intre paragrafe
                                    pdf.ln();
                            elif ('<p class="text_obisnuit2">' in pars[i]):

                                # identificam clasa text_obisnuit2 si preluam textul
                                content = re.findall('<p class="text_obisnuit2">(.*?)</p>', pars[i])
                                if (len(content) == 0):
                                    print("Nu am gasit text in paragraful {}, fisierul {}.".format(pars[i], file_path))
                                else:

                                    # setam fontul cu bold
                                    pdf.set_font('Kanit', size=12, style="B")

                                    # punem textul intr-o celula multi_cell
                                    for simbol in dict_simboluri.keys():
                                        content[0] = content[0].replace(simbol, dict_simboluri[simbol])
                                    pars_text.append(content[0])
                                    # pdf.multi_cell(w=190, txt = content[0])
                                    pdf.write_html(text=f'<p class="text_obisnuit2">{content[0]}</p>')

                                    # adaugam linie goala intre paragrafe
                                    pdf.ln();

                                    # resetam fontul
                                    pdf.set_font('Kanit', size=12)
                            else:
                                continue

                    # adaugare link
                    pdf.ln()
                    pdf.ln()
                    pdf.set_font('Kanit', size=12, style="B")
                    pdf.cell(txt="Source:")
                    pdf.set_font('Kanit', size=12)
                    pdf.set_text_color(0, 102, 204) # albastru
                    pdf.cell(w=40, txt="https://neculaifantanaru.com/{}".format(file_name), link="https://neculaifantanaru.com/{}".format(file_name))

                    den_fisier = file_path.split('.')[0] + '.pdf'
                    pdf.output(den_fisier)
                    # break;

# functie care face merge la mai multe fisiere pdf
def merge_pdf_files(directory_path):
    merger = PdfFileMerger()
    for root, dirs, files in os.walk(directory_path):
        for file_name in files:
            if file_name.endswith(".pdf"):
                print("PDF: ", file_name)
                file_path = root + os.sep + file_name
                merger.append(file_path)
        merger.write(root + os.sep + "articles.pdf")
        merger.close()
        break;

save_to_pdf("c:\\Folder5\\")
merge_pdf_files("c:\\Folder5\\")
RedShy commented 2 years ago

If you want both Bold and Italic you need to add the corresponding font. So add pdf.add_font("Kanit", style="BI", fname="fonts/Kanit-BoldItalic.ttf") under pdf.add_font("Kanit", style="I", fname="fonts/Kanit-Italic.ttf")

Also it doesn't work that you set pdf.set_font('Kanit', size=12, style="B") for making it bold, you need to add the html tag, e.g. adding the <b>...</b> tag.

You could modify pdf.write_html(text=f'<p class="text_obisnuit2">{content[0]}</p>') in pdf.write_html(text=f'<p class="text_obisnuit2"><b>{content[0]}</b></p>')

me-suzy commented 2 years ago

ok, works.

One more thing. I also have another kind of tag, into paragraph. I have a <span class="text_obisnuit2"></span> into the paragraph starting with <p class="text_obisnuit"></p> as below:

Example:

<p class="text_obisnuit"><span class="text_obisnuit2">My name is James:</span> and I want to go home by Night.</p>

Must look like this in PDF (My name is James with BOLD and the rest of words to be normal text):

My name is James: and I want to go home by Night.

Please tell me where, and how to change my code as to work?

RedShy commented 2 years ago

Currently, as written in the documentation, fpdf2 doesn't support CSS, so in this case you may want to replace <span class="text_obisnuit2"></span> with <b>...</b>.

For example you could use file_content = re.sub('<span class="text_obisnuit2">(.*)</span>', '<b>\g<1></b>', file_content) to do the replacement in the entire html file.

Please tell me where, and how to change my code as to work?

I would put that line before everything else, just after opening the file, because I view it as pre-processing the file before using it with fpdf2.

me-suzy commented 2 years ago

Brilliant. Thanks.

me-suzy commented 2 years ago

I made a short tutorial with my code, that you helped me finnish it. Thanks for your help.

Maybe some one needs a complete code for fpdf library.

https://neculaifantanaru.com/en/python-convert-multiple-html-pages-into-one-pdf-file-with-libraria-fpdf-and-fpdf2.html

Lucas-C commented 2 years ago

Thank you for sharing your tutorial @me-suzy And thank you very much @RedShy for assisting here

I'm closing this issue now as things seem resolved