tmenier / Flurl

Fluent URL builder and testable HTTP client for .NET
https://flurl.dev
MIT License
4.11k stars 380 forks source link

Method GetStringAsync returns gibberish content when used on specific website #744

Closed hahyes closed 1 year ago

hahyes commented 1 year ago

Hi there,

I encountered problem with GetStringAsync called from the response.

Code:

var response = $"https://us.shein.com/"
    .WithHeaders(new
    {
        Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
        Accept_Encoding = "gzip, deflate, br",
        Accept_Language = "pl,en;q=0.9,en-GB;q=0.8,en-US;q=0.7",
        Content_Type = "text/html; charset=utf-8",
        Connection = "keep-alive",
        User_Agent = "PostmanRuntime/7.31.1",
        Host = "us.shein.com",
        Cookie = "language=pl; cookieId=CA6087AC_5122_CF94_7D73_43D7245E54B4; cate_channel_type=2; sessionID_shein=s%3ACBYxpeXqeAPdCtu6CSqsehMabmbEMEC_.tBxRnMkSttO3AtRwWqwe7kchJoWal6lNT%2FuLWz65umI; _cfuvid=VFk57x18UOPOyRMOmx6EtR_iOh.CVlCxJgHc4K2GQCk-1678961686322-0-604800000; WEB_UGID_INIT=1; country=PL; countryId=172; have_show=1; hideCoupon=1; hideCouponWithRequest=1; hideCouponId_time=4051709_1; OptanonAlertBoxClosed=2023-03-16T10:15:01.027Z; ssrAbt=SellOutShow_type%3DB%23%23CccGoodsdetail_undefined%23%23SellingPoint_type%3Dsellingpoint%23%23outlocalsize_%7B%22typel%22%3A%22A%22%2C%22range%22%3A%22detail%22%7D%23%23Mall_1_0; currency=PLN; default_currency_expire=1; bi_session_id=bi_1678970487176_84496; default_currency=PLN; addressCookie=%7B%22countryId%22%3A%22172%22%2C%22createdTime%22%3A1678971385011%2C%22isUserHandle%22%3A%220%22%2C%22siteUid%22%3A%22pl%22%7D; __cf_bm=i9KbcSoFlJC2nF.9KWTWTv7S3w_WyFQFFz30wq7pRoc-1678972049-0-ATJ8W5I7OjvMIosUMwbnk+lUDvLm0Zkt//fxJ3Q20/9CT1/mCqRXbH3v3VRuQlNjI5HIersn29wTNqqDxpt40Fvd90P2Zl7U+VhKdlhE59Wxmct3geiCFdJehuqq5pbVinuwK3vSkyR34BZ+P9f0Pdn5uu50z/sFFy7YloETSMkx; OptanonConsent=isIABGlobal=false&datestamp=Thu+Mar+16+2023+14%3A08%3A17+GMT%2B0100+(czas+%C5%9Brodkowoeuropejski+standardowy)&version=6.13.0&hosts=&consentId=9a938d3f-95a9-4865-8e62-8012a63f49c7&interactionCount=1&landingPath=NotLandingPage&AwaitingReconsent=false&groups=C0001%3A1%2CC0002%3A0%2CC0003%3A0%2CC0004%3A0&iType=2&geolocation=PL%3B14",
        Upgrade_Insecure_Requests = 1
    }).AllowAnyHttpStatus().GetAsync().Result;

var htmlContent = response.GetStringAsync().Result;

As you can see, code is rather simple. Result is coded in unknown encoding. I tried UTF8, UTF7, UTF32, UTF16, Unicode, ASCII... nothing works. I tried to detect enconding with GetBytesAsync and library called UTF.Unknown, yet it couldn't even detect encoding of string.

I tried the same thing with RestSharp and I got HTML source in proper enconding without any problems.

.NET 7.0 and Flurl.Http 3.2.4 VS2022 / Windows 11

tmenier commented 1 year ago

Sorry, I don't believe there's any bug here and this is likely a programming question that I don't know the answer to. I can confidently say Flurl doesn't simply receive bytes from the server and convert them to jibberish, and what exactly RestSharp is doing differently in the case of this web site I can't say. I would suggest asking on Stack Overflow where you'll reach a larger audience.

hahyes commented 1 year ago

@tmenier Problem is, bytes are unrecognizable too. Returned bytes cannot be understand by any converter or encoding. This is weird behaviour which with I can't do anything as I simply call library to return me bytes. I even put these bytes into HxD, bytes are untranslable into proper text, there is no way Flurl.Http returned me a proper data.

Can't call it "programming question" when I have no influence over it by simply calling .GetBytesAsync().

tranb3r commented 1 year ago

I think it's caused by the br encoding in this line: Accept_Encoding = "gzip, deflate, br". Can you remove it? If you cannot remove it, then you'll need to decompress the response before using it: https://stackoverflow.com/questions/57038748/decompress-brotli-httpresponse-in-a-httpclient I don't know if/why RestSharp is doing it automatically.

hahyes commented 1 year ago

@tranb3r I removed it and it works now? It is... weird? Like, it is not that I got this header from nowhere. I literally copied it from browser making the request. What's more weird, Postman had the same header and it worked.

So, probably browser and Postman are dealing automatically with it - RestSharp most likely too.

Anyway thank you for answer!

@tmenier So it was a programming question after all. Well, couldn't predict browser/Postman/RestSharp are doing something uncommon in background. Still, little embarrasing. Sorry for wasting time, have a nice day!

tmenier commented 1 year ago

No worries. Glad someone was able to help, community at its finest!

mrtabaa commented 8 months ago

Here's my solution for gzip and deflate using AutomaticDecompression

    string url = "https://my.example.com";

    /// Create an instance of HttpClientHandler
    using HttpClientHandler handler = new HttpClientHandler();

    /// Set the AutomaticDecompression property to DecompressionMethods.GZip
    handler.AutomaticDecompression = DecompressionMethods.GZip;
    // handler.AutomaticDecompression = DecompressionMethods.Deflate;
    // handler.AutomaticDecompression = DecompressionMethods.Brotli;
    // handler.AutomaticDecompression = DecompressionMethods.All;

    /// Create an instance of HttpClient with the handler
    using HttpClient client = new HttpClient(handler);

    /// Send a GET request to a URL that returns compressed content
    using HttpResponseMessage response = await client.GetAsync(url);

    /// Check if the response is successful
    if (response.IsSuccessStatusCode)
    {
        // Read the response content as a string
        // The content will be automatically decompressed by the handler
        string content = await response.Content.ReadAsStringAsync();

        Console.WriteLine(content);
    }
    else
    {
        /// Handle the error
        _logger.LogError($"Error accessing the URL. Is the target server down? Status: {response.StatusCode}");
    }
green-new commented 4 months ago

Holy I'm definitely necro bumping this, I was just trying to find the solution to this SAME EXACT problem I had with Flurl for the past few hours now... Thank GOD google brought me here, please keep this up because this was such a PITA... Just removing the "br" line fixed it. Bless you all.