tidyverse / rvest

Simple web scraping for R
https://rvest.tidyverse.org
Other
1.49k stars 341 forks source link

submit_form fails when form button has no name #278

Closed nalimilan closed 3 years ago

nalimilan commented 4 years ago

I've hit a website where submit_form fails because the form has a single button with no name. So it it sends cu=XX&mp=YY&NULL=OK, while just sending cu=XX&mp=YY works. Below is a reproducer and an illustration of how it works when using request_POST directly without the &NULL=OK part.

> library(rvest) 
> 
> pgsession <- html_session("https://annuaire.cnrs.fr/")
> 
> pgform <- html_form(pgsession)[[1]]
> filled_form <- set_values(pgform, cu="xxyr", mp="jkjlkjlkj")
> submit_form(pgsession, filled_form)
Submitting with 'NULL'
<session> https://annuaire.cnrs.fr/l3c/owa/labintel.controle
  Status: 404
  Type:   text/html; charset=iso-8859-1
  Size:   208
Warning message:
In request_POST(session, url = url, body = request$values, encode = request$encode,  :
  Not Found (HTTP 404).
> 
> # Adjusting request by hand
> x <- rvest:::submit_request(filled_form)
Submitting with 'NULL'
> x$values
$cu
[1] "xxyr"

$mp
[1] "jkjlkjlkj"

$`NULL`
[1] "OK"

> rvest:::request_POST(pgsession, url="https://annuaire.cnrs.fr/l3c/owa/labintel.controle",
+                      body=x$values)
<session> https://annuaire.cnrs.fr/l3c/owa/labintel.controle
  Status: 404
  Type:   text/html; charset=iso-8859-1
  Size:   208
Warning message:
In rvest:::request_POST(pgsession, url = "https://annuaire.cnrs.fr/l3c/owa/labintel.controle",  :
  Not Found (HTTP 404).
> 
> # Drop last value (`NULL`="OK")
> x$values <- x$values[1:2]
> rvest:::request_POST(pgsession, url="https://annuaire.cnrs.fr/l3c/owa/labintel.controle",
+                      body=x$values)
<session> https://annuaire.cnrs.fr/lc/erreur_ident.html
  Status: 200
  Type:   text/html; charset=ISO-8859-1
  Size:   3019
lifesabirch commented 4 years ago

input[@type="submit"] does not have a name attribute: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/input/submit

Potentially change https://github.com/tidyverse/rvest/blob/master/R/form.R 112 to use xpath below instead of using the node name of "input". $x('//input[@type!="submit"]')

hadley commented 3 years ago

Could you please rework your reproducible example to use the reprex package ? That makes it easier to see both the input and the output, formatted in such a way that I can easily re-run in a local session.

lifesabirch commented 3 years ago

Hadley,

The assumption that all input tags have name attributes is the root cause of this issue. <input name="namiverse"> However, when they attribute type is submit, i.e., <input type="submit"> There won't be a name attribute. MDN

Currently on line 112 of https://github.com/tidyverse/rvest/blob/master/R/form.R the check is that the tag is input and then assumes there's a name attribute. Whereas the input tag should not have a name tag if the attribute type is equal to submit. This xpath excludes the input tags where the attribute type == submit : '//input[@type!="submit"]'

Instead of stopifnot(inherits(input, "xml_node"), xml2::xml_name(input) == "input") do stopifnot(inherits(input, "xml_node"), is.na(xml2::xml_find_first(input, '//input[@type!="submit"]')))

hadley commented 3 years ago

@lifesabirch a reprex is very useful because I can then turn that in a test to make sure this issue doesn't happen again. Alternatively, would you be interested in doing a PR?

lifesabirch commented 3 years ago

Hadley,

I'd be happy to do both. I'll learn and do reprex tomorrow night!

hadley commented 3 years ago

I think this should be resolved now, as part of a general revamping of the form code, please let me if not.

nalimilan commented 3 years ago

Thanks!