hrbrmstr / curlconverter

:curly_loop: :arrow_right: :heavy_minus_sign: Translate cURL command lines into parameters for use with httr or actual httr calls (R)
http://rud.is/b/2016/02/10/craft-httr-calls-cleverly-with-curlconverter/
Other
91 stars 12 forks source link

question.. should curlconverter::straigten() fail if curlconverter isn't attached? thanks #15

Open ajdamico opened 7 years ago

ajdamico commented 7 years ago
browserGET <- "curl 'http://www.worldvaluessurvey.org/WVSDocumentationWV4.jsp' -H 'Host: www.worldvaluessurvey.org' -H 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:49.0) Gecko/20100101 Firefox/49.0' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Accept-Language: en-US,en;q=0.5' --compressed -H 'Connection: keep-alive' -H 'Upgrade-Insecure-Requests: 1'"

# fails
curlconverter::straighten( browserGET )

# works
library(curlconverter)
straighten( browserGET )
hrbrmstr commented 7 years ago

hrm. .onAttach() does not get called when you do that and that's where V8 gets initialized. However, I agree that this should work and it shld be as simple as a test for the pkg global being initialized when that function is called.

Huge thanks for finding this edge case. I'll try to get a patch on github tonight.

ajdamico commented 7 years ago

hi, thanks. i guess i'll go with this workaround to eliminate the cran build note until you push the next version to cran :)

https://github.com/ajdamico/lodown/commit/512ed291b126f29f11010e67b3a9d1f1d76b2a7c

thank you for making this possible

# automatically load the world values survey
devtools::install_github("ajdamico/lodown")
library(lodown)
lodown( "wvs" , output_dir = "C:/My Directory/WVS" )
hrbrmstr commented 7 years ago

OH wait. I get the use-case you're doing now. You really don't need to use curlconverter in a pkg that way. If you do just straighten():

library(curlconverter)

browserGET <- "curl 'http://www.worldvaluessurvey.org/WVSDocumentationWV4.jsp' -H 'Host: www.worldvaluessurvey.org' -H 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:49.0) Gecko/20100101 Firefox/49.0' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Accept-Language: en-US,en;q=0.5' --compressed -H 'Connection: keep-alive' -H 'Upgrade-Insecure-Requests: 1'"

you get back a list:

str(straighten(browserGET))
## List of 1
##  $ :List of 5
##   ..$ url      : chr "http://www.worldvaluessurvey.org/WVSDocumentationWV4.jsp"
##   ..$ method   : chr "get"
##   ..$ headers  :List of 6
##   .. ..$ Host                     : chr "www.worldvaluessurvey.org"
##   .. ..$ User-Agent               : chr "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:49.0) Gecko/20100101 Firefox/49.0"
##   .. ..$ Accept                   : chr "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
##   .. ..$ Accept-Language          : chr "en-US,en;q=0.5"
##   .. ..$ Connection               : chr "keep-alive"
##   .. ..$ Upgrade-Insecure-Requests: chr "1"
##   ..$ url_parts:List of 9
##   .. ..$ scheme  : chr "http"
##   .. ..$ hostname: chr "www.worldvaluessurvey.org"
##   .. ..$ port    : NULL
##   .. ..$ path    : chr "WVSDocumentationWV4.jsp"
##   .. ..$ query   : NULL
##   .. ..$ params  : NULL
##   .. ..$ fragment: NULL
##   .. ..$ username: NULL
##   .. ..$ password: NULL
##   .. ..- attr(*, "class")= chr [1:2] "url" "list"
##   ..$ orig_curl: chr "curl 'http://www.worldvaluessurvey.org/WVSDocumentationWV4.jsp' -H 'Host: www.worldvaluessurvey.org' -H 'User-Agent: Mozilla/5."| __truncated__
##   ..- attr(*, "class")= chr [1:2] "cc_obj" "list"
##  - attr(*, "class")= chr [1:2] "cc_container" "list"

Which means you can either use dput() to capture that structure or saveRDS() to turn it into an R data file which you can have auto-loaded in your pkg.

You're prbly going the next step and doing a make_req():

straighten(browserGET) %>%
  make_req() -> req

One thing that I've been struggling how to make clearer is that immediately after make_req() is called the contents (source code) of the function it creates is placed on the clipboard. i.e. if you cmd-v (mac) or ctrl-v (win) in the editor you'll get the source code for the function placed right where the cursor is. In this case:

httr::VERB(verb = "GET", url = "http://www.worldvaluessurvey.org/WVSDocumentationWV4.jsp", 
    httr::add_headers(Host = "www.worldvaluessurvey.org", 
        `User-Agent` = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:49.0) Gecko/20100101 Firefox/49.0", 
        Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 
        `Accept-Language` = "en-US,en;q=0.5", 
        Connection = "keep-alive", 
        `Upgrade-Insecure-Requests` = "1"))

You could also get that by just typing req[[1]] (no parens) at the R console:

function () 
httr::VERB(verb = "GET", url = "http://www.worldvaluessurvey.org/WVSDocumentationWV4.jsp", 
    httr::add_headers(Host = "www.worldvaluessurvey.org", `User-Agent` = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:49.0) Gecko/20100101 Firefox/49.0", 
        Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 
        `Accept-Language` = "en-US,en;q=0.5", Connection = "keep-alive", 
        `Upgrade-Insecure-Requests` = "1"))
<environment: 0x10675c0d8>

that adds some cruft which is why i did made it "auto copy to clipboard".

That particular curl translation can be simplified to (when i do this for my own projected i iteratively remove individual cookies and headers until i get the minimum viable httr verb call I can):

GET(url="http://www.worldvaluessurvey.org/WVSDocumentationWV4.jsp"))

I'm still going to make straighten() work via :: calling but I wanted to make sure you knew ^^ since it's unlikely you really do need to use curlconverter within a pkg.

hrbrmstr commented 7 years ago

I think you're going to need to use a different target. A great deal of the content on that page is dynamically loaded at run-tme and the center column (which has the citation and data files) that you want to target is also an iframe:

image

(apologies for the faint highlighting due to the dark theme but it shld be visible).

The next problem is that more of the contents is loaded via another call to a javascript file:

image

And, your final problem is that the js file in ^^ loads the actual content but:

image

All of the hrefs are wrapped in a call to DocDownloadLicense() which dynamically builds the form you're prbly familiar with:

image

Without something like RSelenium or seleniumPipes you're not going to be able to automate this and you can't embed either in an R package since you need a back-end selenium grid, standalone selenium server or phantomjs running live to do the work.

ajdamico commented 7 years ago

i think the current version of lodown works without issue?

On Jan 16, 2017 12:56 PM, "boB Rudis" notifications@github.com wrote:

I think you're going to need to use a different target. A great deal of the content on that page is dynamically loaded at run-tme and the center column (which has the citation and data files) that you want to target is also an iframe:

[image: image] https://cloud.githubusercontent.com/assets/509878/21983556/ed7a6810-dbbf-11e6-81b9-c4e7f93dd2d3.png

(apologies for the faint highlighting due to the dark theme but it shld be visible).

The next problem is that more of the contents is loaded via another call to a javascript file:

[image: image] https://cloud.githubusercontent.com/assets/509878/21983640/483b000c-dbc0-11e6-8743-374daf6c6c50.png

And, your final problem is that the js file in ^^ loads the actual content but:

[image: image] https://cloud.githubusercontent.com/assets/509878/21983687/7e8e7468-dbc0-11e6-9e7a-f72b106ad789.png

All of the hrefs are wrapped in a call to DocDownloadLicense() which dynamically builds the form you're prbly familiar with:

[image: image] https://cloud.githubusercontent.com/assets/509878/21983774/d58ab39e-dbc0-11e6-9c53-7fcd71a139ad.png

Without something like RSelenium or seleniumPipes you're not going to be able to automate this and you can't embed either in an R package since you need a back-end selenium grid, standalone selenium server or phantomjs running live to do the work.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/hrbrmstr/curlconverter/issues/15#issuecomment-272855512, or mute the thread https://github.com/notifications/unsubscribe-auth/AANO50bLqROMaJq44bb0YR42ADGpmAkbks5rS2jjgaJpZM4Lj6rl .