tidyverse / rvest

Simple web scraping for R
https://rvest.tidyverse.org
Other
1.49k stars 341 forks source link

html_text2 deletes some spaces between words #372

Open mayeulk opened 1 year ago

mayeulk commented 1 year ago

In some cases, html_text2 deletes some standard spaces between words.

The reproducible example follows:

some_html <- '<p dir="ltr" style="text-align:left;"></p><span style="font-size:0.9375rem;">The sentence starts this way,</span><span style="font-size:0.9375rem;"> </span><span style="font-size:0.9375rem;">then</span><span style="font-size:0.9375rem;"> </span><span style="font-size:0.9375rem;">spaces</span><span style="font-size:0.9375rem;"> </span><span style="font-size:0.9375rem;">disappear</span>'
html_text(read_html(some_html)) # is correct
html_text2(read_html(some_html))  # not correct

The incorrect result is: "The sentence starts this way,thenspacesdisappear"

html_text() works correctly, but on most cases I do need the power of html_text2 (new lines...).

I'm using: rvest_1.0.3 , xml2_1.3.3 in R 4.2.2 (Kubuntu 23.04). (Note: The original html string comes from a rich text area of a Moodle Database activity, see https://docs.moodle.org/402/en/Database_activity; exported from Moodle as a LibreOffice .ods file)
mayeulk commented 1 year ago

Interestingly, removing the first empty paragraph allows a correct conversion:

some_html <- '<p style="text-align:left;"></p><span style="font-size:0.9375rem;">The sentence starts this way,</span><span style="font-size:0.9375rem;"> </span><span style="font-size:0.9375rem;">then</span><span style="font-size:0.9375rem;"> </span><span style="font-size:0.9375rem;">spaces</span><span> </span><span style="font-size:0.9375rem;">disappear</span>'
html_text2(read_html(some_html)) # is not correct

some_html2 <- '<span style="font-size:0.9375rem;">The sentence starts this way,</span><span style="font-size:0.9375rem;"> </span><span style="font-size:0.9375rem;">then</span><span style="font-size:0.9375rem;"> </span><span style="font-size:0.9375rem;">spaces</span><span> </span><span style="font-size:0.9375rem;">disappear</span>'
html_text2(read_html(some_html2)) # is correct
hadley commented 1 year ago

Can you please provide a minimal reprex (reproducible example)? The goal of a reprex is to make it as easy as possible for me to recreate your problem so that I can fix it: please help me help you! If you've never heard of a reprex before, start by reading about the reprex package, including the advice further down the page. Please make sure your reprex is created with the reprex package as it gives nicely formatted output and avoids a number of common pitfalls.

mayeulk commented 1 year ago

Hi, trying my best. But reprex change single quotes to double-quotes, then adds a lot of escapes, which, in my view, might make it harder, not easier, to use. Any way, here is what I came up with:

library(rvest)
some_html <- "<p dir=\"ltr\" style=\"text-align:left;\"></p><span style=\"font-size:0.9375rem;\">The sentence starts this way,</span><span style=\"font-size:0.9375rem;\"> </span><span style=\"font-size:0.9375rem;\">then</span><span style=\"font-size:0.9375rem;\"> </span><span style=\"font-size:0.9375rem;\">spaces</span><span style=\"font-size:0.9375rem;\"> </span><span style=\"font-size:0.9375rem;\">disappear</span>"
html_text2(read_html(some_html))

Created on 2023-08-08 with reprex v2.0.2

"library(rvest)\n\n         \n         html_text2(read_html(some_html))\n         "
#> [1] "library(rvest)\n\n         \n         html_text2(read_html(some_html))\n         "

Created on 2023-08-08 with reprex v2.0.2

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.2.2 Patched (2022-11-10 r83330) #> os Ubuntu 23.04 #> system x86_64, linux-gnu #> ui X11 #> language fr:en_US #> collate fr_FR.UTF-8 #> ctype fr_FR.UTF-8 #> tz Europe/Paris #> date 2023-08-08 #> pandoc 3.1.2 @ /usr/bin/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> cli 3.6.1 2023-03-23 [2] CRAN (R 4.2.2) #> digest 0.6.33 2023-07-07 [2] CRAN (R 4.2.2) #> evaluate 0.21 2023-05-05 [2] CRAN (R 4.2.2) #> fastmap 1.1.1 2023-02-24 [2] CRAN (R 4.2.2) #> fs 1.6.2 2023-04-25 [2] CRAN (R 4.2.2) #> glue 1.6.2 2022-02-24 [2] CRAN (R 4.2.2) #> htmltools 0.5.5 2023-03-23 [2] CRAN (R 4.2.2) #> knitr 1.43 2023-05-25 [2] CRAN (R 4.2.2) #> lifecycle 1.0.3 2022-10-07 [2] CRAN (R 4.2.2) #> reprex 2.0.2 2022-08-17 [2] CRAN (R 4.2.2) #> rlang 1.1.1 2023-04-28 [2] CRAN (R 4.2.2) #> rmarkdown 2.23 2023-07-01 [2] CRAN (R 4.2.2) #> rstudioapi 0.15.0 2023-07-07 [2] CRAN (R 4.2.2) #> sessioninfo 1.2.2 2021-12-06 [2] CRAN (R 4.2.2) #> withr 2.5.0 2022-03-03 [2] CRAN (R 4.2.2) #> xfun 0.39 2023-04-20 [2] CRAN (R 4.2.2) #> yaml 2.3.7 2023-01-23 [2] CRAN (R 4.2.2) #> #> [1] /home/mk/R/x86_64-pc-linux-gnu-library/4.2 #> [2] /usr/local/lib/R/site-library #> [3] /usr/lib/R/site-library #> [4] /usr/lib/R/library #> #> ────────────────────────────────────────────────────────────────────────────── ```
hadley commented 1 year ago

The attributes don't seem to be necessary to illustrate the problem, leading to this similar reprex:

library(rvest)
some_html <- "<p></p><span>The sentence starts this way,</span><span> </span><span>then</span><span> </span><span>spaces</span><span> </span><span>disappear</span>"
html_text2(read_html(some_html))
#> [1] "The sentence starts this way,thenspacesdisappear"

Created on 2023-08-09 with reprex v2.0.2

And we can make it much easier to see what's going on by adding some newlines:

library(rvest)
some_html <- "
  <p></p>
  <span>The sentence starts this way,</span>
  <span> </span>
  <span>then</span>
  <span> </span>
  <span>spaces</span>
  <span> </span>
  <span>disappear</span>
"
html_text2(read_html(some_html))
#> [1] "The sentence starts this way,thenspacesdisappear"

Created on 2023-08-09 with reprex v2.0.2

The key problem appears to be the early closing of the <p> tag. When I fix that the problem goes away:

library(rvest)
some_html <- "
<p>
  <span>The sentence starts this way,</span>
  <span> </span>
  <span>then</span>
  <span> </span>
  <span>spaces</span>
  <span> </span>
  <span>disappear</span>
</p>
"
html_text2(read_html(some_html))
#> [1] "The sentence starts this way, then spaces disappear"

Created on 2023-08-09 with reprex v2.0.2

Looking at the code, it seems like the problem probably arises if you have inline elements following a block element. Fixing that will require some careful thought.

tinygreen commented 10 months ago

Looking at the code, it seems like the problem probably arises if you have inline elements following a block element. Fixing that will require some careful thought.

The problem arises if there are inline and block elements mixed on the same level, regardless of which comes first. Then is_inline() returns false and the elements are parsed as if they were all block elements. This means that all elements are passed to collapse_whitespace() individually. Without the block element, is_inline() returns true and the contents are passed to html_text_inline(), which correctly collapses whitespace after pasting the inline elements together. However, as html_text_inline() ignores <br> tags inside inline elements (issue #351), that function should be changed too.

An element could also contain several block elements with text nodes and inline elements in between. In that case, all non-block nodes between two block nodes should be passed together through collapse_whitespace() before being added to the text buffer.