quickwit-oss / tantivy

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
MIT License
11.8k stars 655 forks source link

wasm RFC #541

Open petr-tik opened 5 years ago

petr-tik commented 5 years ago

Summary

Commit to wasm as one of the targets for tantivy.

Motivation

Makes tantivy available to server-side and web developers natively. Enables developers to use the same index format between server and gives client native bindings to read and query the index. eg. Client-side index queries on small-enough index files.

I expect this to give us a competitive advantage over Lucene and help library adoption rates.

Reference-level explanation

Introduce cargo workspaces

Using Cloudflare's wirefilter as an example, we would move current src/ directory to server/ and create a new directory wasm/.

Provide methods to index on server

The server indexer has 2 entry-points: tantivy-cli and library.

Library

Add a method to IndexWriter that serializes the index to a file.

impl IndexWriter {
...
/// Serialize the index as of the last commit to a file under a given Path
/// Return a result with the Opstamp of serialized index.
pub fn serialize_for_wasm(&self, filename: Path) -> Result<Opstamp> {
}

Helps user build a tantivy index that is later serialized into a binary format that tantivy wasm understands.

tantivy-cli/indexer

Add a flag/question at the end to give users an option to serialize the index to wasm format.

Make the wasm library easy to compile and integrate

Add functions for

Enable integrations

Use wasm-pack to build tantivy-wasm. Ship the repo with a JS/HTML component that makes it easy to integrate tantivy wasm to web applications (backend and frontend).

Drawbacks

tantivy was originally conceived as a library for developing server-side indexers.

Making a serious commitment to wasm will affect feature development, programming style and devops infrastructure.

Incompatibility/lack of features

Although wasm support is being increasingly adopted by browser engines (Chromium and Firefox), the API surface is still limited and continues to change fast. For example, threading support is currently a work in progress in the wasm runtime.

Having a clear focus on 1 type of platform (servers) allows us to optimize our solution using SIMD, Rust intrinsics for different platforms and system-specific structure for lock-less programming provided by crossbeam.

Programming style

Changes in any relevant traits and structures cannot break the wasm build. Will require implementing some functionality twice with conditional compilation flags. This might introduce compromises in code, when Linux-specific features are sacrificed for the sake of wasm-compatibility.

Lose system-specific performance gains.

Adopting an immature platform with little industry backing. The probability of wasm being abandoned is much greater than Linux.

CI infra/build times

Including the wasm target and dependencies (wasm-bindgen, wasm-pack) in every CI build will increase CI time. Since these dependencies are yet to stabilize, we run the risk of being guinea pigs for the bugs in wasm-bindgen, which might break our builds and introduce potentially indefinite delays.

Alternatives

Start a separate project under the tantivy organization. Only guarantee about wasm-compatibility every git tag/release checkpoint. This will keep the same rate of development of features.

Unresolved questions

How to test wasm? Headless browser or in pure Rust?

Future possibilities

Extend tantivy wasm to support the IndexWriter trait. Enable building a wasm application that indexes uploaded files in the tantivy format in the browser. Users will be able to build an index in browser memory and download it to run on their server.

This will further extend tantivy wasm demo and allow users to build an index and run queries against it on client-side.

petr-tik commented 5 years ago

Keen to hear your thoughts @jonfk

fulmicoton commented 5 years ago

A couple of remarks.

If I recall correctly, the smallest wasm tantivy build I managed to get so far was 900KB and that involved a lot of cheating & trimming. Some of this work will affect external crates, like the stemming crate.

I haven't followed what was the state on the standardization of the "WASI", but if we get an Mmap there, it might a very nice use case outside of browser development.

This could be a very nice way to distribute tantivy-cli on any platform, in a very safe way. Possibly it could also be a way to make tantivy usable from different language without too much sweat.

The people from wasmer were kind enough to actually link tantivy as a possible application of their tech.

I have no idea how syscall work so it would be nice to dig a little on that.

fulmicoton commented 5 years ago

According to this doc Mmap is not likely to be part of the standardized WASI any time soon.

That's a bummer.

On tantivy side, that will mean some heavy changes on the way we do io if we want to get compatible eventually.

petr-tik commented 5 years ago

thanks for your notes. Below are my thoughts on the 3 specific topics.

size

I have grepped for stemmer and found occurrences in the tokenizer and query parser modules. I suggest we stick to the MVP, which allows reading the index without building one. This should only affect tokenising user queries, if I understand this correctly?

mmap

I am going off the wasm spec, rather than the WASI.

According to the wasm design doc mmap support is in the future ideas section. I am guessing the rust-wasm team will wait for it to stabilise and hopefully provide a wrapper in the mmap crate.

Either way you make an important point. For the foreseeable future, mmap calls won't be supported by wasm. This means we will need to change our internals and keep them compatible for the same index format across different platforms. If and when mmap support is added, we will need to add conditional compilation flags inside the core library to allow mmap on wasm and server-side.

distribution as a wasi binary for other languages to use

I didn't think of it. My original suggestion was to build tantivy wasm for browsers. However, you are right to draw attention to the potential of wasmer and other wasm runtimes that can help us. They will provide the wrappers and integrations of wasm binaries with CPython, Java. This will save us time, while enabling more developers to embed tantivy across different applications.

Overall

I prefer to under promise and over deliver than vice versa. if wasm is hard to manage and will make it hard to work on server-side features, I would not accept this. In my opinion, promising wasm support now and ignoring or removing it later seems more damaging to tantivy than not promising it at all.

I am still curious to get more thoughts on pain points/costs/disadvantages there might be of adding wasm as a first-class target. It feels like a big commitment for development, toolchain inclusion and CI/ops provisioning.

maufl commented 4 years ago

This would also make it easy to use tantivy in WebExtensions for Firefox/Chrome or MailExtensions in Thunderbird.

BasixKOR commented 4 years ago

Is this RFC about having a .wasm binary for this crate, or having a WASM target-compatible library that serves as other library's dependency?

If the latter is the choice, there is an very simple guide on the WebAssembly Book that might help.

BasixKOR commented 4 years ago

I found that this crate can be built to WASM target with wasm-bindgen feature enabled and default features (mmap) disabled. (Probably intended)

So I think it could be used like this:

[dependencies]
tantivy = { version = "0.11.3", default-features = false, features = ["wasm-bidgen"] }

But I cannot make sure since the current test suite relies on mmap, and so as the many APIs of tantivy.

urbien commented 4 years ago

did not look deeper, but could this mmap in wasi-lib help: https://github.com/WebAssembly/wasi-libc/pull/197

fulmicoton commented 4 years ago

@urbien This part is probably more relevant. https://github.com/WebAssembly/wasi-libc/blob/3e9892fc41dd83fa66f192f963f54c72fb8321c1/libc-bottom-half/mman/mman.c

So the WASI-libc does add support for mmap, but it will read and load the entire file in anonymous memory. That's kind of lame but that should be ok for ppl who want to try it out.

ngbrown commented 3 years ago

I came here looking for a static web site search engine. Think Wikipedia size data with all the files pre-processed and search indexes laid down on disk for static serving. A WASM module in the browser would read the inverted index files as needed to execute the search at hand.

I'm not familiar with the layout of your index files, but I wouldn't want each individual file too big or too small, and a single search shouldn't need too many different files to complete, since each "seek" could be another network fetch.

A website archived to a content addressable storage and wanted to include search would need everything pre-computed at build and upload time. Keeping the content un-changeable and without any backend infrastructure enable reliable sneaker-net movement of these archives and usage across air-gapped networks.

fulmicoton commented 3 years ago

@ngbrown Do you have a specific use case?

ngbrown commented 3 years ago

One very specific example, but it's not the first use of a static search that I've thought of, would be the need for an air-gapped Wikipedia because some countries block it now. The IPFS has a project that provides this (https://blog.ipfs.io/24-uncensorable-wikipedia/) along with several methods to transport the snapshot across the network block, one option is pinning on nodes that have connectivity to both partitions of the network, and the other is to copy an entire snapshot package to a USB drive (https://dweb-primer.ipfs.io/avenues-for-access/sneakernets).

If you check out their snapshot, you'll see it has no search. Dealing with Wikipedia without search is less than ideal. As far as I know, there's no working solution for a static search on sites this big because the indexes would be very large to download just for that specific user's needs. Javascript and WASM is allowable as part of this snapshot so that's why I thought a full search engine like Tantivy could be leveraged.

fulmicoton commented 3 years ago

@ngbrown searching static website is common, but they are usually small enough that javascript libraries do ok.

I have compiled tantivy to WASM before, but the resulting file is around 1MB at least. (more with stemming), so it is not really worth doing it for a simple use case.

Wikipedia on a USB key is interesting. For different reasons, current version of tantivy is not great for this use case however... but we (at Quickwit) are currently working on a version that should up for the task.

ngbrown commented 3 years ago

I've used the in-memory JavaScript search solutions for the smaller use cases (think blog or help documentation). So they do work, but I don't know of one that partitions the indexes and only loads segments on demand for the really big cases.

I really don't think a 1MB WASM file would be a big deal compared to the size of the indexes for a big site like Wikipedia. Microsoft is downloading multiples of that to get .NET running in WASM.

Do you have any more information of this alternative?

The good news is that this use case doesn't need backwards compatible files. They would just get re-built each time, for each version.

HKalbasi commented 3 years ago

I think this blocks https://github.com/matrix-org/seshat/issues/84 which is useful in element matrix web client. Is it possible to remove non wasm compatible parts (I see mmap in this issue) via some cfg?

phiresky commented 3 years ago

I've managed to make tantivy work fully in the browser with databases of arbitrary size (tested with a 14GB database and it works great).

The code and demo is here: https://github.com/tantivy-search/tantivy/pull/1067

It has a demo of Wikipedia full text search that could be put on IPFS as mentioned by @gbrown

aguynamedben commented 3 years ago

@phiresky That's awesome. I'll check out your proof of concept, thanks for sharing it. With Tantivy's use of mmap for storing the index, what does memory usage look like to the end user on your proof of concept? (i.e. in Activity Monitor).

(warning: I'm not well-versed in mmap'd files so there maybe be incorrect assumptions in the following question...)

Is there a way to tell the OS to limit resident memory usage (rss) with an mmap'd file? I want to use Tantivy via WASM for a large client-side index, but I'm concerned the end user will perceive a bunch of memory usage (i.e. in Activity Monitor) when, in practice, the OS should manage the mmap'd file's resident memory usage as it pleases.

phiresky commented 3 years ago

With Tantivy's use of mmap for storing the index, what does memory usage look like to the end user on your proof of concept?

In my POC I replace memory mapping with a manual "page cache". Basically a replacement for what the OS does when memory mapping. With memory mapping the OS chooses how much of it to keep it memory based on how much RAM you have, and automatically evicts it when other stuff needs the space. We can't really do this automatic choosing in the browser since we can't know what other programs on the computer need.

So you can basically choose whatever memory usage you want. You actually have to in my POC, since otherwise it will cache everything forever. So it would probably need an LRU system with a fixed limit on memory usage. Depending on your needs you might want to cache e.g. the whole .fieldnorm files and fetch everything else on demand.

In my tests, normal queries fetch around 1-10MB of data, which is then the same as the memory usage (except for internal data structures, but those shouldn't be very large). So it really shouldn't be much more than a normal website.

aguynamedben commented 1 year ago

Thanks for providing Tantivy and keeping this ticket open.

Even though I know it's not supported, I tried and seemed to get close, but can't get Tantivy working in WASM even though it compiles. I get a panic when actually calling Tantivy code in a WASM context. I know this isn't supported yet, so I'm not surprised it doesn't work, but if anybody has any quick pointers on hacks/workarounds the code is below.

My strategy was:

I think my WASM/Rust/JS is right. I think my Tantivy code is right. I told Tantivy I only care about in-memory index. But still can't get the combination of technologies working.

The update from 0.14 here make it seems like the underlying data storage Tantivy uses is getting closer to some form of in-memory + WASM capability, but I'm not sure if it can work in any WASM environment yet (even if limited?).

Is the underlying issue with WASM the data storage aspect? For me, I want the Lucene-like capabilities, stemming, etc. but am okay with an in-memory index.


The panic, as close as I can get!

image

Cargo.toml

[package]
# search client!
name = "sc"
version = "0.1.0"
edition = "2021"
default-run = "sc"

[lib]
crate-type = ["cdylib"]

[dependencies]
anyhow = "1.0"
fake = { version = "2.6.1", features=['derive'] }
num-format = "0.4.4"
rand = "0.8.5" # required by fake
tantivy = { version = "0.19.2", default-features = false }
uuid = "1.3.3"
wasm-bindgen = "0.2.86"
# needed for WASM
getrandom = { version = "0.2.2", features = ["js"] }
web-sys = { version = "0.3.63", features = ['console'] }

[target.'cfg(target_arch = "wasm32")'.dependencies]
console_error_panic_hook = "0.1.6"

src/lib.rs

use wasm_bindgen::prelude::*;

mod logging;
mod search_index;
mod test_data;

#[wasm_bindgen]
extern "C" {
    fn get_js_name() -> JsValue;
}

#[wasm_bindgen]
pub fn get_rust_name() -> String {
    "BenRs".to_string()
}

#[wasm_bindgen(start)]
pub fn start() {
    let js_name = get_js_name();
    let js_name_string = js_name.as_string().unwrap();
    println!("Hello {}, you wild alien from JavaScript!", js_name_string);

    println!("test search");
    let goofy_string = search_index::test_two();
    search_index::test();
    println!("test search: done, string is {}", goofy_string);
}

src/test_data/mod.rs

use fake::Fake;

// using `faker` module with locales
use fake::faker::name::raw::*;
use fake::locales::*;

pub fn fake_names(count: usize) -> Vec<String> {
    (0..count).map(|_| Name(EN).fake()).collect()
}

src/search_index/mod.rs

use std::time::Instant;
use num_format::{Locale, ToFormattedString};
use tantivy::collector::TopDocs;
use tantivy::query::QueryParser;
use tantivy::schema::*;
use tantivy::{doc, Index, ReloadPolicy};
use uuid::Uuid;
use crate::test_data::fake_names;

const NUM_TEST_RECORDS: usize = 100_000;
// const NUM_TEST_RECORDS: usize = 1_000_000;
// const NUM_TEST_RECORDS: usize = 1;
const TEST_QUERY: &'static str = "standefer";
const DEBUG: bool = false;

#[derive(Debug)]
struct TestDoc {
    uuid: Uuid,
    name: String,
}

pub fn test_two() -> String {
    println!("Geting a string from within search_index module");
    "hi from search_index!".to_string()
}

pub fn test() -> tantivy::Result<()> {
    let mut now: Instant;

    // Define schema
    let mut schema_builder = Schema::builder();
    schema_builder.add_u64_field("uuid_hi", STORED);
    schema_builder.add_u64_field("uuid_lo", STORED);
    schema_builder.add_text_field("name", TEXT | STORED);
    let schema = schema_builder.build();

    // Create the index, this will create meta.json in the directory
    let index = Index::create_in_ram(schema.clone());

    // Get fake data
    now = Instant::now();
    let mut names = fake_names(NUM_TEST_RECORDS);
    names.push("Ben Standefer".to_string());
    let docs: Vec<TestDoc> = names.into_iter().map(|name| {
        TestDoc {
            uuid: Uuid::new_v4(),
            name: name.to_string(),
        }
    }).collect();
    if DEBUG {
        for doc in &docs {
            println!("{:?}", doc);
        }
    }
    println!("Generating data took: {:?} ({} records)", now.elapsed(), NUM_TEST_RECORDS.to_formatted_string(&Locale::en));

    // Write
    now = Instant::now();
    let mut index_writer = index.writer(50_000_000)?;
    let uuid_hi_field = schema.get_field("uuid_hi").unwrap();
    let uuid_lo_field = schema.get_field("uuid_lo").unwrap();
    let name_field = schema.get_field("name").unwrap();
    for doc in &docs {
        let (uuid_hi, uuid_lo) = doc.uuid.as_u64_pair();
        index_writer.add_document(doc!(
            uuid_hi_field => uuid_hi,
            uuid_lo_field => uuid_lo,
            name_field => doc.name.to_string(),
        ))?;

    }
    index_writer.commit()?;
    println!("Indexing data took: {:?}", now.elapsed());

    // Search
    now = Instant::now();
    let reader = index
        .reader_builder()
        .reload_policy(ReloadPolicy::OnCommit)
        .try_into()?;
    let searcher = reader.searcher();
    let query_parser = QueryParser::for_index(&index, vec![name_field]);
    let query = query_parser.parse_query(TEST_QUERY)?;
    let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
    for (_score, doc_address) in top_docs {
        let retrieved_doc = searcher.doc(doc_address)?;
        println!("{}", schema.to_json(&retrieved_doc));
        println!("{}", Uuid::from_u64_pair(
            retrieved_doc.get_first(uuid_hi_field).unwrap().as_u64().unwrap(),
            retrieved_doc.get_first(uuid_lo_field).unwrap().as_u64().unwrap(),
        ));
    }
    println!("Search took: {:?}", now.elapsed());

    Ok(())
}

src/logging.rs

#[cfg(target_arch = "wasm32")]
#[macro_export]
macro_rules! println {
    ($($arg:tt)*) => (web_sys::console::log_1(&format!($($arg)*).into()))
}

#[cfg(not(target_arch = "wasm32"))]
#[macro_export]
macro_rules! println {
    ($($arg:tt)*) => (std::println!($($arg)*))
}

src/target_test.rs

pub fn test() {
    #[cfg(target_arch = "wasm32")]
    {
        println!("target test: wasm32");
    }

    #[cfg(not(target_arch = "wasm32"))]
    {
        println!("target test: NOT wasm32");
    }
}

src/main.rs

use anyhow::Result;

mod logging;
mod target_test;
mod search_index;
mod test_data;

fn main() -> Result<()> {
    target_test::test();

    println!("Let's do this!");
    search_index::test()?;

    Ok(())
}

index.html

<!DOCTYPE html>
<html>
<head>
    <meta charset="UTF-8">
    <title>Hello from Rust!</title>
    <script type="module">
      import init, { greet } from './sc.js';

      async function run() {
        await init('./sc_bg.wasm');
        console.log(greet("World"));
      }

      run();
    </script>
</head>
<body></body>
</html>

index.js

<!DOCTYPE html>
<html>
<head>
    <meta charset="UTF-8">
    <title>Hello from Rust!</title>
    <script type="module">
      import init, { greet } from './sc.js';

      async function run() {
        await init('./sc_bg.wasm');
        console.log(greet("World"));
      }

      run();
    </script>
</head>
<body></body>
</html>

package.json

{
  "scripts": {
    "build": "webpack",
    "serve": "webpack serve"
  },
  "devDependencies": {
    "@babel/core": "^7.22.1",
    "@babel/preset-env": "^7.22.4",
    "@wasm-tool/wasm-pack-plugin": "1.5.0",
    "babel-loader": "^9.1.2",
    "html-webpack-plugin": "^5.3.2",
    "source-map-loader": "^4.0.1",
    "text-encoding": "^0.7.0",
    "webpack": "^5.49.0",
    "webpack-cli": "^4.7.2",
    "webpack-dev-server": "^3.11.2"
  }
}

.babelrc

{
  "presets": ["@babel/preset-env"]
}

webpack.config.js

const path = require('path');
const HtmlWebpackPlugin = require('html-webpack-plugin');
const webpack = require('webpack');
const WasmPackPlugin = require("@wasm-tool/wasm-pack-plugin");

module.exports = {
  entry: './index.js',
  output: {
    path: path.resolve(__dirname, 'dist'),
    filename: 'index.js',
  },
  plugins: [
    new HtmlWebpackPlugin(),
    new WasmPackPlugin({
      crateDirectory: path.resolve(__dirname, ".")
    }),
    // Have this example work in Edge which doesn't ship `TextEncoder` or
    // `TextDecoder` at this time.
    new webpack.ProvidePlugin({
      TextDecoder: ['text-encoding', 'TextDecoder'],
      TextEncoder: ['text-encoding', 'TextEncoder']
    })
  ],
  module: {
    rules: [
      {
        test: /\.js$/,
        exclude: /node_modules/,
        use: {
          loader: 'babel-loader',
          options: {
            presets: ['@babel/preset-env'],
          },
        },
      },
      {
        test: /\.js$/,
        enforce: "pre",
        use: ["source-map-loader"],
      },
    ],
  },
  mode: 'development',
  experiments: {
    asyncWebAssembly: true
  }
};
bushuyev commented 1 year ago

The panic, as close as I can get!

hi, looks like the error comes from Instant which does not support wasm, see https://internals.rust-lang.org/t/is-std-instant-on-webassembly-possible/18913

PSeitz commented 1 year ago

You may want to check out wasix https://wasmer.io/posts/announcing-wasix, they also got a version of tantivy working https://github.com/wasix-org/tantivy/tree/wasix

GeeWee commented 1 year ago

Whether or not Instant works seems to depend on your WASM environment. I have managed to get tantivy (and Instant) to work in a Fastly Compute@Edge environment, but with a lot of caveats and sharp edges

ppodolsky commented 1 year ago

@aguynamedben And also you can check Tantivy fork and search server based on this fork https://github.com/izihawa/summa

It is patched for WASM (and working there perfectly, at least for reads):

Also, I did small patches for my case: parallelized compression of .store, supported custom attributes for segments and entire index, added random collector for subsampling.

Both fork and search server have been in production for a long time, I'm keeping it in line with Tantivy master branch.

ppodolsky commented 1 year ago

Here is WASM package https://github.com/izihawa/summa/tree/master/summa-wasm, it works for me in all browsers including Safari on iOS, but I have never prepared it for being public so it suffers from the lack of any documentation, except for several indirectly related articles in blog: 1 and 2

Anyway, you can look how it is done. It utilizes ThreadPool based on WebWorkers for paralellizing search load. Together with async code patches and accurate using (e.g. using hotcache and not using fieldnorms), it works very fast on multi-segment indices and multi-term queries even if index is living in the network.