sjdirect / abot

Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.
Apache License 2.0
2.25k stars 560 forks source link

Encoding not applied when getting content from response on Abot2 #212

Closed marcmarsinach closed 4 years ago

marcmarsinach commented 4 years ago

On Abot2, the content from the response is obtained before resolving charset encoding. As a result, the content presents issues with special characters on none UTF8 charsets. For instance, getting content with charset windows-1252. At the meantime, I'm overwriting WebContentExtractor with the Abot implementation.

The trick in the classic Abot implementation was passing the encoding (e) to the StreamReader. This is not currently done in the Abot2 implementation:

using (StreamReader sr = new StreamReader(memoryStream, e))

sjdirect commented 4 years ago

Fixed in nuget version 2.0.55