janih / boilerpipe

Boilerplate Removal and Fulltext Extraction from HTML pages
2 stars 0 forks source link

Performance issues with UnicodeTokenizer #80

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. call ArticleExtractor.getInstance().getText() on the example data 
(Stability.html) 

What is the expected output? What do you see instead?
The extraction takes a very long time (1-3 minutes depending on hardware and 
jvm load) with heavy memory re-allocations in StringBuilder during 
Matcher.replaceAll calls. HTML of this size typically takes 2-3s on the same 
hardware.

What version of the product are you using? On what operating system?
1.1.0 & 1.2.0 on Ubuntu 12.04 with Oracle JVM
java version "1.7.0_67"
Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)

Please provide any additional information below.
The attached patch fixes the regressive performance and improves the 
tokenization of tokens containing word, non-word, and transitional characters.

Note: I am not the author of the attached html file causing regressive 
performance.

Original issue reported on code.google.com by johnpme...@gmail.com on 14 Oct 2014 at 7:22

Attachments: