jhy / jsoup

jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.
https://jsoup.org
MIT License
10.95k stars 2.19k forks source link

Matching error with getElementsMatchingText() #2163

Closed nbxiglk0 closed 4 months ago

nbxiglk0 commented 4 months ago

Hi, When i want to get Elements through regex pattern,The matching result is inconsistent with the expectation. for example, this is my test code

public class HtmlParse {

    public static void main(String[] args) throws IOException {
        Document doc = Jsoup.connect("http://127.0.0.1:8090/test.php?q=aa&w=test&e=aa&r=aaa&t=aa").get();
        String html = doc.html();
        System.out.println(html);
        final String regex = ">.*test.*<";
        final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
        Elements element =   doc.getElementsMatchingText(pattern);
        if (!element.isEmpty()){
            element.stream().forEach(System.out::println);
            }
        else {
            System.out.println("Dont find any element");
            }
        }
    }

the html response is

<html>
 <head></head>
 <body>
  <a href="?q=1&amp;w=2&amp;e=3&amp;r=4&amp;t=5"></a>
  <script>
var a = "aa";</script> aa
  <div>
   <textarea>test</textarea>
  </div><input style="color:aa" value="aaa"> <!--
        this is comment
        aa        -->
 </body>
</html>

i want get the textarea element by match ">.*test.*<",but i got nothing,Is there anything wrong with getElementsMatchingText method?
image

jhy commented 4 months ago

Well, note that getElementsMatchingText runs the regex against the parsed [text()](https://jsoup.org/apidocs/org/jsoup/nodes/Element.html#text()) of elements, not the original HTML source.

In your parsed DOM tree, you have element nodes (e.g. textarea) which contain text nodes (e.g. test). So there are no >< characters to match in the >.*test.*< regex.

Also note the difference between getElementsMatchingText and getElementsMatchingOwnText: the former uses text() which includes textnodes of the element and its descendants; whilst the latter uses [ownText()](https://jsoup.org/apidocs/org/jsoup/nodes/Element.html#ownText()) which includes only the element's directly owned textnode(s).

I would suggest doing something like:

String regex = ".*?test.*?";
String selector = String.format("textarea:matchesWholeOwnText(%s)", regex);
Elements els = doc.select(selector);

Or if you prefer:

String regex = ".*?test.*?";
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
Elements els = doc.getElementsMatchingOwnText(pattern);
els.forEach(element -> {
    if (element.nameIs("textarea")) {
        System.out.println("matched");
    }
});

Both of those find the textarea matching the corrected regex.

Hope this helps!