headzoo / surf

Stateful programmatic web browsing in Go.
MIT License
1.49k stars 160 forks source link

Add support for recognizing text/csv content-type during download #13

Closed vkryukov closed 9 years ago

vkryukov commented 9 years ago

Hello,

I'm using surf to login to some website and download a CSV report. My problem is that the report is downloaded as an HTML file, not as a plain text: after

err := bow.Open(reportURL)
if err != nil {
    return err
}
f, err := os.Create(output)
if err != nil {
    return err
}
defer f.Close()
fmt.Println(bow.ResponseHeaders())
i, err := bow.Download(f)

I get a file which is prepended with <html><head></head><body>, some symbols are HTML-escaped, etc.

When I print the headers,

fmt.Println(bow.ResponseHeaders())

I get:

map[Date:[Wed, 18 Feb 2015 02:27:13 GMT] 
Content-Disposition:[attachment; filename=source.csv] 
Content-Type:[text/csv; charset=ISO-8859-1] 
X-Cnection:[close]]

Looks like content-type text/csv is not recognized. When I follow the reportURL link with a browser, I get the file downloaded properly.

Any advice on what's the best way to download the file properly? Or may be it's a bug/feature request...

vkryukov commented 9 years ago

Looking at the code, looks like browser.httpRequest unconditionally calls go query.NewDocumentFromResponse, which parses the body as HTML. I'm not sure what's the best way to override this behavior - any ideas?

I could have built an http.Client myself, however Browser.cookies is not exported, and neither is Browser.buildClient - maybe we should export the latter for cases such as this?

vkryukov commented 9 years ago

Another idea I tried was to use DownloadAsset, but it uses plain http.Get and so cannot leverage cookies already set by the bow. That, BTW, will make it impossible to download assets that require authorization.

groob commented 9 years ago

I'm also having problem with getting a CSV file, but I can't even get as far as @victorkryukov

func SavedQuery(q int) {
    // Takes a saved query ID from SIS and downloads exported CSV
    query := fmt.Sprintf("https://example.com/esweb.asp?WCI=Results&Query=%v", q)
    err := bow.Open(query)
    if err != nil {
        panic(err)
    }
    // Accessing the exported URL directly does not work.
    // I have to go the saved Query URL first and then click 'Export'
    bow.DelRequestHeader("Referer")
    bow.Click("a:contains(' Export')")
    if err != nil || bow.StatusCode() != 200 {
        panic(err)
    }
    // Next click on "Comma-Delimited Text File"
    bow.DelRequestHeader("Referer")
    bow.Click("a:contains('Comma-Delimited Text File')")
    if err != nil || bow.StatusCode() != 200 {
        panic(err)
    }
    // Next click on the link to download CSV
    bow.DelRequestHeader("Referer")
    // f := bow.Links()[0]
    // bow.Download(f.URL)
    bow.Click("a:contains('.csv')")
    if err != nil || bow.StatusCode() != 200 {
        // fmt.Println(bow.Body())
        // fmt.Println(bow.StatusCode())
        // fmt.Println(bow.ResponseHeaders())
        // fmt.Println(bow)
        // panic(err)
    }
    fmt.Println(bow.Body())
}

This returns StatusCode 406. Here is the body of the response.

<h1>The resource cannot be displayed</h1>
The page you are looking for cannot be opened by your browser because it has a file name extension that your browser does not accept.
<hr/>
<p>Please try the following:</p>
<ul>
<li>Change the Multipurpose Internet Mail Extensions (MIME) or security settings of your browser to accept the file name extension of the requested page. Note that your browser might currently be configured in a highly secure mode that protects your computer. Please read the Help for your browser before changing any settings.</li>
</ul>
<h2>HTTP Error 406 - Client browser does not accept the MIME type of the requested page.<br/>Internet Information Services (IIS)</h2>
<hr/>
<p>Technical Information (for support personnel)</p>
<ul>
<li>Go to <a href="http://go.microsoft.com/fwlink/?linkid=8180">Microsoft Product Support Services</a> and perform a title search for the words <b>HTTP</b> and <b>406</b>.</li>
<li>Open <b>IIS Help</b>, which is accessible in IIS Manager (inetmgr),
 and search for topics titled <b>Setting Application Mappings</b>, <b>Securing Your Site with Web Site Permissions</b>, and <b>About Custom Error Messages</b>.</li>
</ul>
headzoo commented 9 years ago

Thanks for submitting a ticket. The Content-Disposition:[attachment; filename=source.csv] header instructs the browser to save the page as a file as the type specified by the Content-Type:[text/csv; charset=ISO-8859-1] header. @victorkryukov is right. The Download() method blindly assumes the current page is text/html. In fact the method doesn't take the response headers into consideration at all.

I'll try to create a fix today.

headzoo commented 9 years ago

This should be fixed in the latest master. The Download() method now writes the raw response body instead of using the value of bow.state.Dom.Html().

vkryukov commented 9 years ago

Hi @headzoo - I can confirm that my issue is fully resolved now. Many thanks!

headzoo commented 9 years ago

@victorkryukov - Thank you!