jamietre / CsQuery

CsQuery is a complete CSS selector engine, HTML parser, and jQuery port for C# and .NET 4.
Other
1.15k stars 249 forks source link

HTML Comment nodes are retrieved as part the .Text() method #167

Open pete-ppc opened 10 years ago

pete-ppc commented 10 years ago

CsQuery Version: 1.3.4 .Net Framework: 4.5

Test case (VB):

Dim div As CsQuery.CQ = New CsQuery.CQ("<div>This is not a comment<!-- , but this is a comment -->, nor is this a comment.</div>")
Dim html As String = div.Html()
Dim text As String = div.Text()

html returns:

"This is not a comment<!-- , but this is a comment -->, nor is this a comment."

text returns:

"This is not a comment , but this is a comment , nor is this a comment."

jQuery, by way of comparison, returns the text content without the comment content:

console.log($('<div>This is not a comment<!-- , but this is a comment -->, nor is this a comment.</div>').text());
"This is not a comment, nor is this a comment."

My workaround was to instantiate the CsQuery.CQ object using the CsQuery.HtmlParsingOptions.IgnoreComments parsing option.

Thank you for this much needed library.

marcselman commented 9 years ago

This issue is still present. Should this be fixed?

pete-ppc commented 9 years ago

My inclination would be to fix it as the purpose of this library seems to be to replicate the functionality of jQuery and this method has a different behavior in jQuery.

tariqporter commented 9 years ago

Comments are still being read by Text(). Sometimes an element will contain ie if statements that will incorrectly become the read text: <!--[if gte mso 9]>...

marcselman commented 9 years ago

I've made two extension methods to strip comments:

public static CQ StripComments(this CQ cq)
{
    if (cq == null) return cq;

    foreach (var element in cq)
    {
        element.StripComments();
    }

    return cq;
}

public static IDomObject StripComments(this IDomObject node)
{
    if (node == null || node.ChildNodes == null) return node;

    List<IDomObject> commentNodes = new List<IDomObject>();
    foreach (var childNode in node.ChildNodes)
    {
        if (childNode.NodeType == NodeType.COMMENT_NODE)
        {
            commentNodes.Add(childNode);
        }

        if (childNode.ChildNodes != null && childNode.ChildNodes.Count > 0)
        {
            childNode.StripComments();
        }
    }
    foreach (var commentNode in commentNodes)
    {
        node.ChildNodes.Remove(commentNode);
    }

    return node;
}