jameslittle230 / stork

🔎 Impossibly fast web search, made for static sites.
https://stork-search.net
Apache License 2.0
2.73k stars 56 forks source link

--output - trims resulting index #262

Open ArsenArsen opened 2 years ago

ArsenArsen commented 2 years ago
while :; do 2>/dev/null stork build --input debug.json --output - | wc --bytes; done

Related to #261. This makes it impossible to use stork as a filter. Notably, /dev/stdout does not have the same issue, implying this is an issue with how Rust opens stdout.

jameslittle230 commented 2 years ago

I'm unable to reproduce this, unfortunately:

❯ while TRUE; cargo run -- build --input local-dev/test-configs/federalist.toml --output - 2> /dev/null | wc; end
    1512   46242 1125456
    1512   46242 1125456
    1512   46242 1125456
    1512   46242 1125456

however, I suspect that merging https://github.com/jameslittle230/stork/pull/272 will fix this issue. If you're able, could you pull that branch, build Stork locally, and retry your test?

ArsenArsen commented 2 years ago
[i] ~/stork 130 $ while :; do stork build --input - --output - 2>/dev/null <local-dev/test-configs/federalist.toml | wc --bytes; done | uniq
1124840
1125456
1125420
1125456
1124926
1125456
1124919
1125183
1125456
1125220
1125456

The above is unpatched. For some reason, the Federalist Papers example reproduces this issue a lot less (I had to use uniq to reduce the non-wrong result spam).

Patched:

[c] ~/stork$ while :; do ./target/debug/stork build --input - --output - 2>/dev/null <local-dev/test-configs/federalist.toml | wc --bytes; done
1125456
1125456
1125451
1125456
1125456
1124734
1125456

It'd appear flushing does not help (IIRC, I tried this myself after opening the issue anyways). For some reason, though, STORK-262/fix-write-to-stdout is insanely slow.

Please try this on a glibc system (such as the Debian Docker container) too.

It's probably worth noting that I'm using rustc 1.58.1 and cargo 1.58.0

PS: Is there some realtime communication channel? It'd likely be more ergonomic to test these kinds of weird issues that way

jameslittle230 commented 2 years ago

Weird - thanks for checking. I'll be sure not to merge #272 if it makes things too slow.

When you were reproducing it with your own config, was it producing index files that were bigger or smaller than the Federalist Papers example?

I'll keep working on a repro and check back in.

There's no chat set up for the project - I haven't had a need to spin something like that up yet, and I don't yet have a good sense for how useful it would be over Github issues and discussions. Happy to consider it, though - any suggestions?

ArsenArsen commented 2 years ago

Weird - thanks for checking. I'll be sure not to merge https://github.com/jameslittle230/stork/pull/272 if it makes things too slow.

Flush alone should't, I think this was just system load. I can't reproduce it now. Even when reverting the BTreeMap changes I only get a 13% increase in speed (builds per second).

When you were reproducing it with your own config, was it producing index files that were bigger or smaller than the Federalist Papers example?

I was under the impression I included my results - my bad! Considerably smaller.

381545
378037
381545
381523

I'll keep working on a repro and check back in.

This just gets weirder, I am now unable to reproduce it with the flush. This issue would seem to be fixed now? Well, at any rate, stdout not being flushed on exit seems like a Rust runtime bug too.

There's no chat set up for the project - I haven't had a need to spin something like that up yet, and I don't yet have a good sense for how useful it would be over Github issues and discussions. Happy to consider it, though - any suggestions?

I don't really have any special suggestions here, just the usual (Matrix or Libera.Chat; Zulip is also a thing some swear by but I haven't used it much). Whatever works for you works for me

applejag commented 2 years ago

Hello I'm experiencing the same as we're trying to implement Stork into Emanote (https://github.com/EmaApps/emanote/pull/327).

When using --output - the index becomes "corrupted"/unusable.

To repro:

  1. Clone https://github.com/jilleJr/notes
  2. Run this snippet to generate an ad-hoc config file:
    echo -e "[input]\nfiles = [" > stork.toml
    while read -r file
    do
    echo "  {path=\"$file\", url=\"$file\", title=\"$(basename "$file")\"}," >> stork.toml
    done < <(find content -name '*.md')
    echo "]" >> stork.toml
  3. Build the index:
    stork build -i stork.toml -o index-from-flag.st
    stork build -i stork.toml -o /dev/stdout > index-from-stdout.st
    stork build -i stork.toml -o - > index-from-dash.st
  4. Attempt a search:

    $ stork search -q foo -i index-from-flag.st
    (large json output)
    
    $ stork search -q foo -i index-from-stdout.st
    (large json output)
    
    $ stork search -q foo -i index-from-dash.st
    thread 'main' panicked at 'split_to out of bounds: 679254 <= 679213', /home/kalle/.cargo/registry/src/github.com-1ecc6299db9ec823/bytes-1.1.0/src/bytes.rs:402:9
    note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

(Using Stork v1.5.0 btw)

jameslittle230 commented 2 years ago

@jilleJr - thanks for the repro steps. I'll take a look later today and report back what I find.