Closed makew0rld closed 1 year ago
I've confirmed using Datasette would require converting/exporting hyperbee into sqlite. Depending on how easy Datasette is to use and the value-add, this exporting could still be worth it.
Doing this requires adding indexing etc. for each hyperbee instance.
We will not be using Datasette.
Let's focus on (1) and (2), and get @rangermauve to help with this later this month.
A search index can be created by putting the searched-for property as a key, and CID(s) as the value. Some questions for this setup:
Index keys should probably start with index/
to namespace them.
Unrelated, but searching should be done server-side, not by having the client load the full index. The client should send the query ("all CIDs in this time range") and the server will return the result. Combined queries (like for lat and lon) should be possible for efficiency.
There are only two options for searching keys. db.get
for when you know exactly what key you want (index/city/toronto
maybe) or db.createReadStream
for getting kv pairs where the key bytes are within a certain range. Key encoding is important for the second option. Keys in the index must be encoded so that values that are adjacent semantically are also adjacent when converted into bytes and sorted into a list with other keys, ie we "preserve byte order".
Examples:
That should cover range searches. The remaining questions are how to normalize and set certain keys. For example should New York
become new york
or new_york
? Where will the author value come from? Etc.
Keys must be unique, should a sorted array of matching CIDs be stored at each property then?
Ues. gemerally I include the unique ID in the key as well as in the value. You could potentially just store it in the key and have something empty for the value and then parse the CID from the key. This is also why using CBOR has been useful for serializing to and from the key.
Should the index entries also be treated as attestations with signing and timestamping?
Agreed with your points.
Also starting with index/
is a good idea. I also usually create a name for the index based on which fields it's indexing over. Also lately I've been using just /i/
to save on key size. Also the /
should ideally be the byte 0x00
so that you can search for everything at that index by doing {start: 0x00 i 0x00
, end: 0x00 i 0xFF
}. This is a common technique used in the leveldb ecosystem.
, but searching should be done server-side
Personally, I like doing searching client side so that verifiability can be done as you stream data over the network. Also you don't need to load the entire index and just need to touch a few of the nodes in the B tree to get to the index start and run an iterator from there.
Might be good to do some benchmarking to test assumptions there though.
One thing about server-side search is you now need to trust that it isn't omitting results and need to have some way of proving that it performed the search the way you expected (I guess this is optional depending on your threat model).
Key encoding is important for the second option
Yeah, encoding keys is the tricky bit and all the things you mentioned are the right thing to consider. Generally, starting with ints and converting timestamps/dats to ints is a good place to start. I'm not yet up to date on what the best way to encode floats is.
For spatial data we should create a quadkey based index instead of lat/long since it's easier to perform AABB queries on top of. link
For normalizing text values, it's something that should be defined at the application level via schemas. Lots of ways to go about it so deciding on which you want for a given use case and normalizing any data before ingesting into the DB would be important.
Mauve recommends putting the CID into the actual index key, and making many entries for every CID with matching properties that way. The actual value of the index key can be empty, or the CID again.
We've agreed for now continuing on the client-side makes sense, and using /
is ok.
Regarding quadkeys, it will definitely add a lot of complexity that might not be necessary. It's not just a conversion from coordinates into quadkeys, but a longer process: latlon -> pixelXY -> tileXY -> quadkey. And repeating the process for different "level of detail" values. I wonder if simply using two separate float indexes on lat and lon might work better for our more basic use cases.
Edit: Ben confirmed we are not doing this for now
To clarify: index key format is i/<index name>/<index value>/<matching cid>
. For example: i/city/toronto/bafy...
Merged search
branch in 177d99cbf651b95a19f806545cf9e32ebf243218. This adds the indexing functions to the demo2
branch, where work for actually creating the index will happen. This issue will be closed once search functionality actually exists in that branch.
One remaining question (low priority, not needed for the demo) is what fields to we want to index on in the import script? When importing integrity pipeline ZIPs, what indexes do we want to create by default? Of course they can also be added later.
What fields do y'all have so far? Usually indexing by date and "author" or "type" if there heterogenous data is a good start.
Note: one remaining issue with search is that even though DAG-CBOR encodes integers and floats differently, there is no way to tell in JavaScript which is which. This either requires forking the decoder library, or just indexing everything as floats. The safe range of integers for float64 is -(2^53 – 1)
to 2^53 – 1
(~9 quadrillion), which should be good enough for our purposes.
Edit: see starlinglab/uwazi-aa#4
Basic search functionality (just direct matching) exists in starlinglab/authenticated-attributes#46 and has for a while. Closing this issue, and new issues can be made in the future for specific search features or improvements.
This comes from MVK9.
The user may want to search across their subscribed hyperbees for any data published by a specific author, or mentioning a certain keyword, all archives of a specific domain, etc.
This reminds me a bit of Datasette, maybe there is a way we can use that as like an offline explorer for your databases. That would require adding custom uwazi-hyperbee support, or exporting the hyperbee databases into a format Datasette supports.