jhy / jsoup

jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.
https://jsoup.org
MIT License
10.95k stars 2.19k forks source link

Some CSS query find elements very slowly #1956

Closed 821938089 closed 1 year ago

821938089 commented 1 year ago

Example document : https://www.123duw.com/dudu-38/51311/ CSS query : dl > dt:nth-of-type(2) ~ dd > a In jsoup, It took about 4 minutes.

In Chrome, it took only 50ms to complete. image

jsoup version : 1.16.1

jhy commented 1 year ago

What platform are you on? I tried this and on my M1, it took 7.1 seconds (first run, no hotspot etc). Certainly it would be good to improve that, but I wonder if something else is impacting the speed?

821938089 commented 1 year ago

I am running on Android platform and my phone chip is Snapdragon 660.

[Snipped large images]

821938089 commented 1 year ago

This amount of operations looks like an exponential increase.

You can try how long this document takes: https://www.123duw.com/dudu-32/50731/

jhy commented 1 year ago

OK, surprised that it performs so differently on Android. I guess the combination of the high query count and the memory allocation in nth-of-type is causing the difference in.

Yes, I believe the ~ (preceding sibling) combinator is causing the query count to go exponential. Similar to repeatedly backtracking to try every variant. If we can rein that in, would expect a massive performance improvement.

jhy commented 1 year ago

Thanks for reporting this! I have optimized the selector by changing the query evaluation order of the any preceding sibling operator, and by memoizing the results. On my M1 this moves the execution time from ~ 7 seconds to 0.01 seconds.

I also improved the memory consumption (by removing ephemeral calls to children(), which allocates a new Elements list on each hit).

So I expect you will see a good improvement; if you could test with a snapshot version and report back that would be great!

Future work would be to improve in what order the And evaluator is executed, so simpler expressions (like a tag test) occur before slower expressions like nth-of-type.

821938089 commented 1 year ago

Yes, I see a huge improvement. The execution time has been reduced from 4 min 30 s to 125 ms.

Test Result ![IMG_20230529_165548](https://github.com/jhy/jsoup/assets/8674809/26d60023-a132-4126-9aaf-0adb9ddb0167)
jhy commented 1 year ago

I made the memoization a bit more generic, so that it applies to each of the Structural Evaluators. And made it more correct so that reused Evaluators have a ThreadLocal memo.

c57e68388f521bbb84cc4b281e6f79eb46d7a3ad