Closed ezekg closed 2 years ago
I also tried exclude_html_selector = '.noindex'
, but I still see code blocks with that class showing up in the index. I’ll try to put together a reproducible example tomorrow morning.
Hmm, that's not great. I'll take a look and see if exclude_html_selector
stopped working at some point, and make sure it works with pre tags. Thanks for the report :)
-James
It seems to happen when there are more than one .noindex
tag on a page. Only the first tag is excluded.
Here's a reproducible test case:
[input]
exclude_html_selector = '.noindex'
html_selector = 'main'
files = [
{ path = 'index.html', url = '/', title = 'Index' },
]
<main>
<p>DO_INDEX</p>
<p class='noindex'>DO_NOT_INDEX</p>
<p class='noindex'>DO_NOT_INDEX</p>
</main>
When you run stork test --config stork.toml
, the word DO_NOT_INDEX
will be indexed once.
Possibly related to https://github.com/kuchiki-rs/kuchiki/issues/81 and Stork's usage below?
Also https://github.com/kuchiki-rs/kuchiki/issues/85#issuecomment-781900129 is worth reading too.
Here's a failing test for stork-lib/src/index_v3/build/fill_intermediate_entries/word_list_generators/html_word_list_generator.rs
:
#[test]
fn test_html_content_extraction_with_multiple_excluded_selectors() {
run_html_parse_test(
"This content should be indexed This content should also be indexed",
Some(".yes"),
Some(".no"),
r#"
<html>
<head></head>
<body>
<h1>This is a title</h1>
<main>
<section class="yes" id="first">
<p>This content should be indexed</p>
<p id="second">This content should also be indexed</p>
<p class="no">This content should not be indexed</p>
<p class="no">This content should also not be indexed</p>
</section>
</main>
</body>
</html>"#,
)
}
failures:
---- index_v3::build::fill_intermediate_entries::word_list_generators::html_word_list_generator::tests::test_html_content_extraction_with_multiple_excluded_selectors stdout ----
thread 'index_v3::build::fill_intermediate_entries::word_list_generators::html_word_list_generator::tests::test_html_content_extraction_with_multiple_excluded_selectors' panicked at 'assertion failed: `(left == right)`
left: `"This content should be indexed This content should also be indexed"`,
right: `"This content should be indexed This content should also be indexed This content should also not be indexed"`: expected: This content should be indexed This content should also be indexed
computed: This content should be indexed This content should also be indexed This content should also not be indexed', stork-lib/src/index_v3/build/fill_intermediate_entries/word_list_generators/html_word_list_generator.rs:172:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Thanks for the fix, @jameslittle230! All .noindex
tags are now ignored, and our search index is down to 25MB uncompressed (607 KB compressed with Brotli). 🖖
That's great to hear!
My company's documentation has a lot of
<pre>
tags containing code snippets, and Stork seems to be indexing all of these. Is there a way for me to configure Stork to ignore<pre>
and possibly even<code>
tags? I tried settingexclude_html_selector = 'pre'
and also triedexclude_html_selector = 'pre, code'
but neither seem to have an effect.I'm guessing this is also why our search index is about 120MB uncompressed, which I'd really like to lower. 😃