Open pupaxxo opened 5 years ago
There is no need to add meta
or img
on remove list, since they are stripped anyway (available for all void elements [area, base, br, col, embed, hr, img, input, link, meta, param, source, track, wbr]).
Yes, they are not needed and they can be removed. I don't think this "fix" the issue. Just a simple not-closed style or script tag will empty all the tokens, indexing anything. It would be better (and more safe) if Manticore, when it cannot find the closing tag, doesn't empty all the page contents, maybe, it can only remove the opening tag. We would better handle non-closed script/style tags (indexing at least something) and in the case of self-closing tags it would work as expected. What do you think?
html_remove_elements = style, script, img, meta
Tokens that should be correctly indexed
So the problem here appears is when:
The logic is that the code behind html_remove_elements doesn't check if the element is supposed to contain any content or not (like in case of meta), it just removes it.
What we could do to improve this behavior is: 1) show a warning that meta is listed in html_remove_elements while it normally shouldn't as it's stripped anyway. 2) silently skip removing content for void elements if they're listed in html_remove_elements.
We'll think about this.
Let's now think about this:
It would be better (and more safe) if Manticore, when it cannot find the closing tag, doesn't empty all the page contents
Speaking about this in general (not about void elements) it doesn't seem practical as it's difficult to understand when a closing tag (for a non-void element, e.g. <p>
) is missing intentionally and when it's not. E.g.
csvpipe_command = echo "1,'<div><p>abc</div>def <div>other content</div>'"
...
html_remove_elements = p
Will leave the index empty, but it's fine as everything after the opening <p>
is the <p>
element which according the config should be removed and we cannot guess to where it makes sense to put the closing </p>
We've decided to not allow void elements in html_remove_elements, i.e. they will be just skipped. The idea is that they're anyway stripped and given they can't have any content it's less error prone to just skip them if they're listed in html_remove_elements.
Perfect. Can I ask you if it's possibile to update the RT configuration on an already populated index? We saw that the configuration is saved in the index files, not allowing us to update it. We need to fully recreate the index?
you could with ALTER RECONFIGURE
however RT saves already collected data into disk chunk then apply new setting, ie old data will be with old settings new data will be with new settings
Manticore 2.8.2 on CentOS 7
~ $ searchd -v
Manticore 2.8.2 4e81114@190402 release
Copyright (c) 2001-2016, Andrew Aksyonoff
Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)
Copyright (c) 2017-2019, Manticore Software LTD (http://manticoresearch.com)
Describe the problem
The
html_remove_elements
configuration option does not correctly handle self-closing HTML tags.Steps to reproduce:
With a configuration like this:
If we try to index this HTML:
manticore won't index any token. If the meta tag is closed like this
<meta charset="utf-8" />
it will work as expected. We think that Manticore is removing all the HTML after the meta tag, not detecting that meta is a self-closing tag, so it doesn't need a tag.