Matching error with getElementsMatchingText()

jhy / jsoup

jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.

MIT License

10.95k stars 2.19k forks source link

public class HtmlParse { public static void main(String[] args) throws IOException { Document doc = Jsoup.connect("http://127.0.0.1:8090/test.php?q=aa&w=test&e=aa&r=aaa&t=aa").get(); String html = doc.html(); System.out.println(html); final String regex = ">.*test.*<"; final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE); Elements element = doc.getElementsMatchingText(pattern); if (!element.isEmpty()){ element.stream().forEach(System.out::println); } else { System.out.println("Dont find any element"); } } }

<html> <head></head> <body> <a href="?q=1&w=2&e=3&r=4&t=5"></a> <script> var a = "aa";</script> aa <div> <textarea>test</textarea> </div><input style="color:aa" value="aaa">  </body> </html>

Well, note that getElementsMatchingText runs the regex against the parsed [text()](https://jsoup.org/apidocs/org/jsoup/nodes/Element.html#text()) of elements, not the original HTML source.

In your parsed DOM tree, you have element nodes (e.g. textarea) which contain text nodes (e.g. test). So there are no >< characters to match in the >.*test.*< regex.

Also note the difference between getElementsMatchingText and getElementsMatchingOwnText: the former uses text() which includes textnodes of the element and its descendants; whilst the latter uses [ownText()](https://jsoup.org/apidocs/org/jsoup/nodes/Element.html#ownText()) which includes only the element's directly owned textnode(s).

I would suggest doing something like:

String regex = ".*?test.*?";
String selector = String.format("textarea:matchesWholeOwnText(%s)", regex);
Elements els = doc.select(selector);

Or if you prefer:

String regex = ".*?test.*?";
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
Elements els = doc.getElementsMatchingOwnText(pattern);
els.forEach(element -> {
    if (element.nameIs("textarea")) {
        System.out.println("matched");
    }
});

Both of those find the textarea matching the corrected regex.

Hope this helps!

jhy / jsoup

Matching error with getElementsMatchingText() #2163