How to configure Stork to ignore <pre> tags?

ezekg commented 2 years ago

My company's documentation has a lot of <pre> tags containing code snippets, and Stork seems to be indexing all of these. Is there a way for me to configure Stork to ignore <pre> and possibly even <code> tags? I tried setting exclude_html_selector = 'pre' and also tried exclude_html_selector = 'pre, code' but neither seem to have an effect.

I'm guessing this is also why our search index is about 120MB uncompressed, which I'd really like to lower. 😃

ezekg commented 2 years ago

I also tried exclude_html_selector = '.noindex', but I still see code blocks with that class showing up in the index. I’ll try to put together a reproducible example tomorrow morning.

jameslittle230 commented 2 years ago

Hmm, that's not great. I'll take a look and see if exclude_html_selector stopped working at some point, and make sure it works with pre tags. Thanks for the report :)

-James

ezekg commented 2 years ago

It seems to happen when there are more than one .noindex tag on a page. Only the first tag is excluded.

Here's a reproducible test case:

[input]
exclude_html_selector = '.noindex'
html_selector = 'main'
files = [
  { path = 'index.html', url = '/', title = 'Index' },
]

<main>
  <p>DO_INDEX</p>
  <p class='noindex'>DO_NOT_INDEX</p>
  <p class='noindex'>DO_NOT_INDEX</p>
</main>

When you run stork test --config stork.toml, the word DO_NOT_INDEX will be indexed once.

Possibly related to https://github.com/kuchiki-rs/kuchiki/issues/81 and Stork's usage below?

https://github.com/jameslittle230/stork/blob/4d301cd641a09ccd9e7964182b0513761dc3c20b/stork-lib/src/index_v3/build/fill_intermediate_entries/word_list_generators/html_word_list_generator.rs#L50-L54

Also https://github.com/kuchiki-rs/kuchiki/issues/85#issuecomment-781900129 is worth reading too.

ezekg commented 2 years ago

Here's a failing test for stork-lib/src/index_v3/build/fill_intermediate_entries/word_list_generators/html_word_list_generator.rs:

#[test]
fn test_html_content_extraction_with_multiple_excluded_selectors() {
    run_html_parse_test(
        "This content should be indexed This content should also be indexed",
        Some(".yes"),
        Some(".no"),
        r#"
    <html>
        <head></head>
        <body>
            <h1>This is a title</h1>
            <main>
                <section class="yes" id="first">
                    <p>This content should be indexed</p>
                    <p id="second">This content should also be indexed</p>
                    <p class="no">This content should not be indexed</p>
                    <p class="no">This content should also not be indexed</p>
                </section>
            </main>
        </body>
    </html>"#,
    )
}

failures:

---- index_v3::build::fill_intermediate_entries::word_list_generators::html_word_list_generator::tests::test_html_content_extraction_with_multiple_excluded_selectors stdout ----
thread 'index_v3::build::fill_intermediate_entries::word_list_generators::html_word_list_generator::tests::test_html_content_extraction_with_multiple_excluded_selectors' panicked at 'assertion failed: `(left == right)`
  left: `"This content should be indexed This content should also be indexed"`,
 right: `"This content should be indexed This content should also be indexed This content should also not be indexed"`: expected: This content should be indexed This content should also be indexed
computed: This content should be indexed This content should also be indexed This content should also not be indexed', stork-lib/src/index_v3/build/fill_intermediate_entries/word_list_generators/html_word_list_generator.rs:172:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

ezekg commented 2 years ago

Thanks for the fix, @jameslittle230! All .noindex tags are now ignored, and our search index is down to 25MB uncompressed (607 KB compressed with Brotli). 🖖

jameslittle230 commented 2 years ago

That's great to hear!

jameslittle230 / stork

How to configure Stork to ignore <pre> tags? #279