jameslittle230 / stork

🔎 Impossibly fast web search, made for static sites.
https://stork-search.net
Apache License 2.0
2.73k stars 56 forks source link

how to use Stork to handle a list of 200-500 documents? (lunrjs style) #355

Open stargazer33 opened 1 year ago

stargazer33 commented 1 year ago

I have a collection of following... let say "data structures":

{
  "id": "abcd123455",
  "title": "Some title",
  "body": "Contents of the blog post..."
},
{
  "id": "xyz986724",
  "title": "Another great title",
  "body": "another contents..."
}

These "data structures" are in my database, so I can export them in any format (HTML, text, JSON, YML...) There are about 200-500 "data structures" per search index. They all have an unique ID (and this ID is not URL). The "body" is about one or two screens big. On the backend I have complete control, so can generate what is necessary, run the stork build command etc...

At the moment the search functionality on my site is implemented with the help of lunrjs (see lunrjs.com/guides/core_concepts.html). I am thinking about migration to Stork. But... reading Stork documentation I get the impression that Stork is designed to index... let say 5-10 big (HTML?) pages.

So, the question is: how to use Stork to handle a list (collection? array?) of 200-500 documents? I mean - how to use Stork in the "lunrjs scenario"? (The first idea that comes into my mind is to generate the *.toml config file with one [[input.files]] entry for each document/"data structure" - and to put the documents into a separate file each (200-500 files!). Probably an overkill, I do not think Stork was designed for this)

karlwilcox commented 1 year ago

I use stork to index the text content of about 11,000 web pages and use your first idea of doing some "pre-processing" to create a .toml config that contains everything I want to be indexed. I have a PHP script that scans the relevant HTML files and produces something like this:

[input]
    frontmatter_handling = "Omit"
    stemming = "None"
    minimum_indexed_substring_length = 4
files = [
{ url = "/gallery/000001", title = "Petre (from Boutell's Heraldry)", contents = "this shield was used by boutell as the primary [snip....]  ", filetype="PlainText" },
{ url = "/gallery/000002", title = "Boyd Garrison", contents = "shield device of [snip..] ", filetype="PlainText" },
{ url = "/gallery/000003", title = "Example of Varying Edge Types", contents = "this example demonstrates [snip..].  ", filetype="PlainText" },
[snip 11,000 additional entries]
]

This has the advantage that I can also "pre-scan" the input to take out any terms that I don't want to have included in the index,