jstedfast / HtmlKit

A cross-platform .NET framework for parsing HTML
Other
81 stars 55 forks source link

HtmlKit manipulates the attribute value when they contain HTML special entities #17

Closed mohammadreza-plutoflume closed 1 year ago

mohammadreza-plutoflume commented 1 year ago

Describe the bug HtmlKit manipulates the attribute value when they contain HTML special entities! We expect when an attribute value is returned it should literarily be equal to the input stream!

Platform (please complete the following information):

To Reproduce Steps to reproduce the behavior:

  1. Create a new Console application
  2. Add the following HTML file in the project and mark it as Copy if newer:

    <!DOCTYPE html>
    
    <html lang="en" xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <meta charset="utf-8" />
        <title></title>
    </head>
    <body>
        <a href="https://google.com/?q=val-&laquo;-val" name="val-&laquo;">The First Link</a>
        <br />
        <a href="https://google.com/?q=val-&amp;-val" name="val-&amp;">The Second Link</a>
    </body>
    </html>
  3. Copy-Paste the following code in Program.cs file:

    using HtmlKit;
    
    namespace HtmlKitTestProject
    {
        internal class Program
        {
            static void Main(string[] args)
            {
                using var stream = new FileStream("index.html", FileMode.Open, FileAccess.Read);
                using var reader = new StreamReader(stream);
    
                var tokenizer = new HtmlTokenizer(reader);
                HtmlToken token;
    
                while (tokenizer.ReadNextToken(out token))
                {
                    switch (token.Kind)
                    {
                        case HtmlTokenKind.Tag:
                            var tag = (HtmlTagToken)token;
    
                            if (tag.Id != HtmlTagId.A)
                                continue;
    
                            foreach (var attribute in tag.Attributes)
                            {
                                if (attribute.Value != null)
                                    Console.WriteLine(" {0}={1}", attribute.Name, $"{attribute.Value}");
                                else
                                    Console.WriteLine(" {0}", attribute.Name);
                            }
                            break;
                    }
                }
    
                Console.ReadLine();
            }
        }
    }
  4. Run the project and check the output: image

Expected behavior The HTML file contains attributes with some HTML special entities as their values: image When an attribute value is returned it should literarily be equal to the input stream but, as you see it's converted to their decoded version! image

jstedfast commented 1 year ago

I don't understand why this is a bug?