manticoresoftware / manticoresearch

Easy to use open source fast database for search | Good alternative to Elasticsearch now | Drop-in replacement for E in the ELK soon
https://manticoresearch.com
GNU General Public License v3.0
9.03k stars 506 forks source link

The html_remove_elements configuration option does not correctly handle self-closing HTML tags. #280

Open pupaxxo opened 4 years ago

pupaxxo commented 4 years ago

Manticore 2.8.2 on CentOS 7

~ $ searchd -v
Manticore 2.8.2 4e81114@190402 release
Copyright (c) 2001-2016, Andrew Aksyonoff
Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)
Copyright (c) 2017-2019, Manticore Software LTD (http://manticoresearch.com)

Describe the problem

The html_remove_elements configuration option does not correctly handle self-closing HTML tags.

Steps to reproduce:

With a configuration like this:

html_remove_elements = style, script, img, meta    

If we try to index this HTML:

<html>    
<head>    
<meta charset="utf-8">    
</head>    
<body>    
Tokens that should be correctly indexed    
</body>    
</html>    

manticore won't index any token. If the meta tag is closed like this <meta charset="utf-8" /> it will work as expected. We think that Manticore is removing all the HTML after the meta tag, not detecting that meta is a self-closing tag, so it doesn't need a tag.

adriannuta commented 4 years ago

There is no need to add meta or img on remove list, since they are stripped anyway (available for all void elements [area, base, br, col, embed, hr, img, input, link, meta, param, source, track, wbr]).

pupaxxo commented 4 years ago

Yes, they are not needed and they can be removed. I don't think this "fix" the issue. Just a simple not-closed style or script tag will empty all the tokens, indexing anything. It would be better (and more safe) if Manticore, when it cannot find the closing tag, doesn't empty all the page contents, maybe, it can only remove the opening tag. We would better handle non-closed script/style tags (indexing at least something) and in the case of self-closing tags it would work as expected. What do you think?

manticoresearch commented 4 years ago

html_remove_elements = style, script, img, meta

Tokens that should be correctly indexed

So the problem here appears is when:

The logic is that the code behind html_remove_elements doesn't check if the element is supposed to contain any content or not (like in case of meta), it just removes it.

What we could do to improve this behavior is: 1) show a warning that meta is listed in html_remove_elements while it normally shouldn't as it's stripped anyway. 2) silently skip removing content for void elements if they're listed in html_remove_elements.

We'll think about this.

Let's now think about this:

It would be better (and more safe) if Manticore, when it cannot find the closing tag, doesn't empty all the page contents

Speaking about this in general (not about void elements) it doesn't seem practical as it's difficult to understand when a closing tag (for a non-void element, e.g. <p>) is missing intentionally and when it's not. E.g.

    csvpipe_command = echo "1,'<div><p>abc</div>def <div>other content</div>'"
...
   html_remove_elements = p

Will leave the index empty, but it's fine as everything after the opening <p> is the <p> element which according the config should be removed and we cannot guess to where it makes sense to put the closing </p>

manticoresearch commented 4 years ago

We've decided to not allow void elements in html_remove_elements, i.e. they will be just skipped. The idea is that they're anyway stripped and given they can't have any content it's less error prone to just skip them if they're listed in html_remove_elements.

pupaxxo commented 4 years ago

Perfect. Can I ask you if it's possibile to update the RT configuration on an already populated index? We saw that the configuration is saved in the index files, not allowing us to update it. We need to fully recreate the index?

tomatolog commented 4 years ago

you could with ALTER RECONFIGURE

however RT saves already collected data into disk chunk then apply new setting, ie old data will be with old settings new data will be with new settings