rdvojmoc / DinkToPdf

C# .NET Core wrapper for wkhtmltopdf library that uses Webkit engine to convert HTML pages to PDF.
MIT License
1.08k stars 415 forks source link

charset configuration not works #49

Open agpcardoso opened 6 years ago

agpcardoso commented 6 years ago

Regardless of configuration the charset not change

1 - below it's my html that I'm trying to convert

<html>
<head>
    <meta http-equiv=Content-Type content="text/html; charset=us-ascii">
</head>
<body>
    Todo mundo é com acento.
</body>
</html>

2- below it's my code to do the conversion

Controller Code

        [HttpGet("PrintTests")]
        public object PrintTests()
        {
            PDFDocument x = new PDFDocument();
            x.teste();
            return null;
        }

x.teste() Function Code

    public void teste()
    {
        string _content = string.Empty;
        using (System.IO.StreamReader file = new System.IO.StreamReader(@"C:\Temp\teste.html"))
        {
            _content = file.ReadToEnd();
            file.Close();
        }

        var globalSettings = new GlobalSettings
        {
            ColorMode = ColorMode.Color,
            Orientation = Orientation.Portrait,
            PaperSize = PaperKind.A4,
            Margins = new MarginSettings { Top = 10 },
            DocumentTitle = "PDF Report",
            Out = @"C:\Temp\teste_1.pdf",

        };

        ObjectSettings _obsettings = new ObjectSettings();
        _obsettings.HtmlContent = _content;
        _obsettings.WebSettings.DefaultEncoding = "us-ascii";
        _obsettings.WebSettings.LoadImages = true;
        _obsettings.IncludeInOutline = true;

        var pdf = new HtmlToPdfDocument()
        {
            GlobalSettings = globalSettings,
            Objects = { _obsettings }
        };

        var _converter = new BasicConverter(new PdfTools());
        _converter.Convert(pdf);
        _converter.Tools.Dispose();
    }

3 - below it's my pdf file converted teste_1.pdf

agpcardoso commented 6 years ago

To do a bypass, I had to create this function that apply HttpUtility.HtmlEncode only in content ignoring all html and script code

If anyone needs it, here is my code below

Using

string _returnHtmlTreated = this.ApplyHtmlTreatments(@"\Letters\ModelsLetterShipment\", "Model1.htm");


        public string ApplyHtmlTreatments(string directoryModeloHtml, string nameFileModeloHtml)
        {
            string _conteudoArquivoPorLinha = string.Empty;
            StringBuilder _retornoConteudoTratado = new StringBuilder();
            StringBuilder _todoHtml = new StringBuilder();

            //Abre arquivo HTML
            //-------------------
            using (System.IO.StreamReader file = new System.IO.StreamReader(directoryModeloHtml + nameFileModeloHtml, Encoding.GetEncoding("iso-8859-1")))
            {
                //Concatena todo o conteudo linha a linha na variavel _todoHTML
                //-------------------------------------------------------------
                while ((_conteudoArquivoPorLinha = file.ReadLine()) != null)
                    _todoHtml.Append(_conteudoArquivoPorLinha + " ");

                //Atribui 10 espaços antes e depois de cada tag html
                //--------------------------------------------------
                _todoHtml.Replace("<", "          <")
                         .Replace(">", ">          ")
                         .Replace("&nbsp;", "          &nbsp;          ");

                //Transforma em um trecho a cada 10 espaços jogando cada trecho em um array de trechos
                //------------------------------------------------------------------------------------
                var _trechosArray = _todoHtml.ToString().Split("          ");

                //Varre os trechos sendo que trechos de HTML onde será aplicado o encoding SOMENTE para
                //trechos que NÃO são HTML
                //-------------------------------------------------------------------------------------
                foreach (var _trecho in _trechosArray)
                {
                    string _trechoTratado = string.Empty;

                    //SE _trecho NAO for uma tag HTML trata, caso contrario NÃO trata
                    if ((Regex.Match(_trecho.Trim(), @"<.*?>", RegexOptions.IgnoreCase).Success == false) && _trecho.Trim() != "&nbsp;")
                        _trechoTratado = HttpUtility.HtmlEncode(_trecho.Trim()) + " ";
                    else
                    {

                        //if this part is an img tag I set the complete path including the string file:///
                        if (_trecho.Trim().IndexOf("<img") >= 0)
                            _trechoTratado = _trecho.Trim().Replace("src=\"", "src=\"file:///" + @directoryModeloHtml.Replace(@"\",@"/"));
                        else
                            _trechoTratado = _trecho.Trim() + " ";
                    }

                    _retornoConteudoTratado.Append(_trechoTratado);
                }

                file.Close();
            }

            return _retornoConteudoTratado.ToString();

        }
amaters-easy commented 5 years ago

Perhaps it is related so I post it here. We have a similar problem when the provided html is not valid xhtml. Like for instance <meta ....> should in xhtml be closed. If I provide the html (without the closing ) the generated PDF is plain text representation of the Html. When I do provide the closing tag a proper PDF is generated