hylong / cx-extractor

Automatically exported from code.google.com/p/cx-extractor
0 stars 0 forks source link

损耗时间的一步 #4

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
source = links.matcher(source).replaceAll("");

样例:http://news.itxinwen.com/2013/0802/515691.shtml

单是这一步 将耗时90s+

建议:可以直接通过source = source.replaceAll("<[^>]+>", "");  
移除所有Tag?

Original issue reported on code.google.com by ywq1...@gmail.com on 2 Aug 2013 at 8:01

GoogleCodeExporter commented 9 years ago
private static Pattern links = Pattern.compile("<[^>]+>.*?</[aA]>");

考虑到<a>contents<a>这样更好些

唯一的缺陷是 如果正文有带有超链接的文字段也将被删除了

Original comment by ywq1...@gmail.com on 2 Aug 2013 at 9:57

soybrian commented 6 months ago

Dear Sender,

We regret to inform you that the delivery of the email has failed. Reason: This email doesn’t exist. Our system encountered issues preventing successful transmission.

Please verify the correctness of your email address. If the problem persists, we recommend checking your email server settings or contacting your IT support.

We apologize for any inconvenience this may have caused.

Regards,

Automated System Notification