[MVK9] Search and indexing

makew0rld commented 1 year ago

This comes from MVK9.

The user may want to search across their subscribed hyperbees for any data published by a specific author, or mentioning a certain keyword, all archives of a specific domain, etc.

This reminds me a bit of Datasette, maybe there is a way we can use that as like an offline explorer for your databases. That would require adding custom uwazi-hyperbee support, or exporting the hyperbee databases into a format Datasette supports.

makew0rld commented 1 year ago

I've confirmed using Datasette would require converting/exporting hyperbee into sqlite. Depending on how easy Datasette is to use and the value-add, this exporting could still be worth it.

makew0rld commented 1 year ago

Search components

Across CIDs
Returns CIDs with matching properties
At a location (Toronto, etc.)
Within a time range
By a certain author
...

Doing this requires adding indexing etc. for each hyperbee instance.

We will not be using Datasette.

benhylau commented 1 year ago

Let's focus on (1) and (2), and get @rangermauve to help with this later this month.

makew0rld commented 1 year ago

Index questions

A search index can be created by putting the searched-for property as a key, and CID(s) as the value. Some questions for this setup:

Keys must be unique, should a sorted array of matching CIDs be stored at each property then? The time complexity of this is good but it requires reading the whole array into memory for every operation
Should the index entries also be treated as attestations with signing and timestamping?
- I think not because it will inflate the resource and time cost
- This makes the index not exportable and provable
- But the index kv pairs still maintain cryptographic integrity through the hypercore key

Index keys should probably start with index/ to namespace them.

Unrelated, but searching should be done server-side, not by having the client load the full index. The client should send the query ("all CIDs in this time range") and the server will return the result. Combined queries (like for lat and lon) should be possible for efficiency.

makew0rld commented 1 year ago

Index key encoding

There are only two options for searching keys. db.get for when you know exactly what key you want (index/city/toronto maybe) or db.createReadStream for getting kv pairs where the key bytes are within a certain range. Key encoding is important for the second option. Keys in the index must be encoded so that values that are adjacent semantically are also adjacent when converted into bytes and sorted into a list with other keys, ie we "preserve byte order".

Examples:

Unsigned integers must be manually encoded as fixed-length big-endian unsigned integers, rather than varints or DAG-CBOR
Timestamps: convert into Unix timestamps (or any other single number format) encoded as integers as described above
Floats: store as big-endian bytes, and standardize on float size. If the sign bit is 0, invert it; if it's 1, invert all bits. source 1, source 2
- Playground: https://float.exposed
Signed integers: works the same as floats, which I confirmed at integer.exposed. Must be fixed-length as well.

That should cover range searches. The remaining questions are how to normalize and set certain keys. For example should New York become new york or new_york? Where will the author value come from? Etc.

RangerMauve commented 1 year ago

Keys must be unique, should a sorted array of matching CIDs be stored at each property then?

Ues. gemerally I include the unique ID in the key as well as in the value. You could potentially just store it in the key and have something empty for the value and then parse the CID from the key. This is also why using CBOR has been useful for serializing to and from the key.

Should the index entries also be treated as attestations with signing and timestamping?

Agreed with your points.

Also starting with index/ is a good idea. I also usually create a name for the index based on which fields it's indexing over. Also lately I've been using just /i/ to save on key size. Also the / should ideally be the byte 0x00 so that you can search for everything at that index by doing {start: 0x00 i 0x00, end: 0x00 i 0xFF}. This is a common technique used in the leveldb ecosystem.

, but searching should be done server-side

Personally, I like doing searching client side so that verifiability can be done as you stream data over the network. Also you don't need to load the entire index and just need to touch a few of the nodes in the B tree to get to the index start and run an iterator from there.

Might be good to do some benchmarking to test assumptions there though.

One thing about server-side search is you now need to trust that it isn't omitting results and need to have some way of proving that it performed the search the way you expected (I guess this is optional depending on your threat model).

Key encoding is important for the second option

Yeah, encoding keys is the tricky bit and all the things you mentioned are the right thing to consider. Generally, starting with ints and converting timestamps/dats to ints is a good place to start. I'm not yet up to date on what the best way to encode floats is.

For spatial data we should create a quadkey based index instead of lat/long since it's easier to perform AABB queries on top of. link

For normalizing text values, it's something that should be defined at the application level via schemas. Lots of ways to go about it so deciding on which you want for a given use case and normalizing any data before ingesting into the DB would be important.

makew0rld commented 1 year ago

Mauve recommends putting the CID into the actual index key, and making many entries for every CID with matching properties that way. The actual value of the index key can be empty, or the CID again.

We've agreed for now continuing on the client-side makes sense, and using / is ok.

makew0rld commented 1 year ago

Regarding quadkeys, it will definitely add a lot of complexity that might not be necessary. It's not just a conversion from coordinates into quadkeys, but a longer process: latlon -> pixelXY -> tileXY -> quadkey. And repeating the process for different "level of detail" values. I wonder if simply using two separate float indexes on lat and lon might work better for our more basic use cases.

Edit: Ben confirmed we are not doing this for now

makew0rld commented 1 year ago

To clarify: index key format is i/<index name>/<index value>/<matching cid>. For example: i/city/toronto/bafy...

makew0rld commented 1 year ago

Merged search branch in 177d99cbf651b95a19f806545cf9e32ebf243218. This adds the indexing functions to the demo2 branch, where work for actually creating the index will happen. This issue will be closed once search functionality actually exists in that branch.

makew0rld commented 1 year ago

One remaining question (low priority, not needed for the demo) is what fields to we want to index on in the import script? When importing integrity pipeline ZIPs, what indexes do we want to create by default? Of course they can also be added later.

RangerMauve commented 1 year ago

What fields do y'all have so far? Usually indexing by date and "author" or "type" if there heterogenous data is a good start.

makew0rld commented 1 year ago

Note: one remaining issue with search is that even though DAG-CBOR encodes integers and floats differently, there is no way to tell in JavaScript which is which. This either requires forking the decoder library, or just indexing everything as floats. The safe range of integers for float64 is -(2^53 – 1) to 2^53 – 1 (~9 quadrillion), which should be good enough for our purposes.

Edit: see starlinglab/uwazi-aa#4

makew0rld commented 1 year ago

Basic search functionality (just direct matching) exists in starlinglab/authenticated-attributes#46 and has for a while. Closing this issue, and new issues can be made in the future for specific search features or improvements.

starlinglab / authenticated-attributes