jamietre / CsQuery

CsQuery is a complete CSS selector engine, HTML parser, and jQuery port for C# and .NET 4.
Other
1.16k stars 249 forks source link

Weird characters in loaded HTML #187

Open marcselman opened 9 years ago

marcselman commented 9 years ago

Hi,

I noticed some weird characters popping up in the HTML when using CQ.CreateFromUrl. Here is an example:

var c = CQ.CreateFromUrl("http://www.cswonen.nl/sint-willebrord-monseigneur-van-hooydonkstraat-NLH00452695006");
c.Document.Body.OuterHTML.Dump();

When you execute above example (in LinqPad for example) you'll notice in the output:

<img src="http://public���������������������������������������������������������������������������������������������������������������������������������������������.parariusoffice.nl/45/photos/export/2695006.1429611799-844.jpg" alt="Foto van">

I have no idea where the weird characters come from. I don't see them in the HTML source when loading it in the browser or in Sublime Text. If I load the page in c# into a string and then load the string into a CQ object it works without problems.

Do you have any idea what this could be? Thanks.

rufanov commented 9 years ago

It's a bug. Three was unwanted nulls after first package from webserver if it's size been less than 4096.

Little messy test, that illustrate this bug.

using CsQuery.HtmlParser;
using Microsoft.VisualStudio.TestTools.UnitTesting;
using NUnit.Framework;
using System;
using System.IO;
using System.Text;
using Assert = NUnit.Framework.Assert;

namespace CsQuery.Tests.Issues
{
    [TestFixture, TestClass]
    public class Issue187 : CsQueryTest
    {
        [Test, TestMethod]
        public void Issue187Test()
        {
            using (var mockStream = new Issue187MockStream())
            {
                var factory = new ElementFactory();
                var dom = factory.Parse(mockStream, Encoding.UTF8);

                Assert.AreEqual(Issue187MockStream.HTML, dom.FirstChild.OuterHTML);
            }
        }
    }
    public class Issue187MockStream : Stream
    {
        public const string HTML = @"<html><head></head><body><a href=""http://test.example.com"">Test</a></body></html>";

        public override int Read(byte[] buffer, int offset, int count)
        {
            byte[] bytes = Encoding.UTF8.GetBytes(HTML);

            int splitPosition = bytes.Length / 2;
            int lenght;

            if (Position == 0)
            {
                lenght = splitPosition;
                Array.Copy(bytes, buffer, splitPosition);
            }
            else if (Position == splitPosition)
            {
                lenght = bytes.Length - splitPosition;
                Array.Copy(bytes, splitPosition, buffer, 0, lenght);
            }
            else
            {
                lenght = 0;
            }

            Position += lenght;
            return lenght;
        }

        public override bool CanRead { get { return true; } }
        public override bool CanSeek { get { return false; } }
        public override bool CanWrite { get { return false; } }

        public override long Position { get; set; }
        public override void Flush() { return; }

        public override long Length { get { throw new NotImplementedException(); } }
        public override long Seek(long offset, SeekOrigin origin) { throw new NotImplementedException(); }
        public override void SetLength(long value) { throw new NotImplementedException(); }
        public override void Write(byte[] buffer, int offset, int count) { throw new NotImplementedException(); }
    }
}