jaeksoft / opensearchserver

Open-source Enterprise Grade Search Engine Software
http://www.opensearchserver.com
Apache License 2.0
499 stars 191 forks source link

content is extracted twice while using a regexp in the HTMLParser on HtmlSource field #1897

Open emmanuel-keller opened 6 years ago

emmanuel-keller commented 6 years ago

Add a regexp to HTML Parser on the htmtSource field : (?s)(<article(?:.?)?>(.?)<\/article>) The content is extracted twice