Open mayeulk opened 1 year ago
Interestingly, removing the first empty paragraph allows a correct conversion:
some_html <- '<p style="text-align:left;"></p><span style="font-size:0.9375rem;">The sentence starts this way,</span><span style="font-size:0.9375rem;"> </span><span style="font-size:0.9375rem;">then</span><span style="font-size:0.9375rem;"> </span><span style="font-size:0.9375rem;">spaces</span><span> </span><span style="font-size:0.9375rem;">disappear</span>'
html_text2(read_html(some_html)) # is not correct
some_html2 <- '<span style="font-size:0.9375rem;">The sentence starts this way,</span><span style="font-size:0.9375rem;"> </span><span style="font-size:0.9375rem;">then</span><span style="font-size:0.9375rem;"> </span><span style="font-size:0.9375rem;">spaces</span><span> </span><span style="font-size:0.9375rem;">disappear</span>'
html_text2(read_html(some_html2)) # is correct
Can you please provide a minimal reprex (reproducible example)? The goal of a reprex is to make it as easy as possible for me to recreate your problem so that I can fix it: please help me help you! If you've never heard of a reprex before, start by reading about the reprex package, including the advice further down the page. Please make sure your reprex is created with the reprex package as it gives nicely formatted output and avoids a number of common pitfalls.
Hi, trying my best. But reprex change single quotes to double-quotes, then adds a lot of escapes, which, in my view, might make it harder, not easier, to use. Any way, here is what I came up with:
library(rvest)
some_html <- "<p dir=\"ltr\" style=\"text-align:left;\"></p><span style=\"font-size:0.9375rem;\">The sentence starts this way,</span><span style=\"font-size:0.9375rem;\"> </span><span style=\"font-size:0.9375rem;\">then</span><span style=\"font-size:0.9375rem;\"> </span><span style=\"font-size:0.9375rem;\">spaces</span><span style=\"font-size:0.9375rem;\"> </span><span style=\"font-size:0.9375rem;\">disappear</span>"
html_text2(read_html(some_html))
Created on 2023-08-08 with reprex v2.0.2
"library(rvest)\n\n \n html_text2(read_html(some_html))\n "
#> [1] "library(rvest)\n\n \n html_text2(read_html(some_html))\n "
Created on 2023-08-08 with reprex v2.0.2
The attributes don't seem to be necessary to illustrate the problem, leading to this similar reprex:
library(rvest)
some_html <- "<p></p><span>The sentence starts this way,</span><span> </span><span>then</span><span> </span><span>spaces</span><span> </span><span>disappear</span>"
html_text2(read_html(some_html))
#> [1] "The sentence starts this way,thenspacesdisappear"
Created on 2023-08-09 with reprex v2.0.2
And we can make it much easier to see what's going on by adding some newlines:
library(rvest)
some_html <- "
<p></p>
<span>The sentence starts this way,</span>
<span> </span>
<span>then</span>
<span> </span>
<span>spaces</span>
<span> </span>
<span>disappear</span>
"
html_text2(read_html(some_html))
#> [1] "The sentence starts this way,thenspacesdisappear"
Created on 2023-08-09 with reprex v2.0.2
The key problem appears to be the early closing of the <p>
tag. When I fix that the problem goes away:
library(rvest)
some_html <- "
<p>
<span>The sentence starts this way,</span>
<span> </span>
<span>then</span>
<span> </span>
<span>spaces</span>
<span> </span>
<span>disappear</span>
</p>
"
html_text2(read_html(some_html))
#> [1] "The sentence starts this way, then spaces disappear"
Created on 2023-08-09 with reprex v2.0.2
Looking at the code, it seems like the problem probably arises if you have inline elements following a block element. Fixing that will require some careful thought.
Looking at the code, it seems like the problem probably arises if you have inline elements following a block element. Fixing that will require some careful thought.
The problem arises if there are inline and block elements mixed on the same level, regardless of which comes first. Then is_inline()
returns false and the elements are parsed as if they were all block elements. This means that all elements are passed to collapse_whitespace()
individually. Without the block element, is_inline()
returns true and the contents are passed to html_text_inline()
, which correctly collapses whitespace after pasting the inline elements together. However, as html_text_inline()
ignores <br>
tags inside inline elements (issue #351), that function should be changed too.
An element could also contain several block elements with text nodes and inline elements in between. In that case, all non-block nodes between two block nodes should be passed together through collapse_whitespace()
before being added to the text buffer.
In some cases, html_text2 deletes some standard spaces between words.
The reproducible example follows:
The incorrect result is: "The sentence starts this way,thenspacesdisappear"
html_text() works correctly, but on most cases I do need the power of html_text2 (new lines...).