What steps will reproduce the problem?
1. call ArticleExtractor.getInstance().getText() on the example data
(Stability.html)
What is the expected output? What do you see instead?
The extraction takes a very long time (1-3 minutes depending on hardware and
jvm load) with heavy memory re-allocations in StringBuilder during
Matcher.replaceAll calls. HTML of this size typically takes 2-3s on the same
hardware.
What version of the product are you using? On what operating system?
1.1.0 & 1.2.0 on Ubuntu 12.04 with Oracle JVM
java version "1.7.0_67"
Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
Please provide any additional information below.
The attached patch fixes the regressive performance and improves the
tokenization of tokens containing word, non-word, and transitional characters.
Note: I am not the author of the attached html file causing regressive
performance.
Original issue reported on code.google.com by johnpme...@gmail.com on 14 Oct 2014 at 7:22
Original issue reported on code.google.com by
johnpme...@gmail.com
on 14 Oct 2014 at 7:22Attachments: