Currently we put every type of document into one vector DB:
GitHub issues
sections of Go documentation
gerrit CLs
and so on.
Our Related Entities API (#22) may want to (a) let users ask for a subset of the possible types, and (b) classify results by type.
As far as classifying the results, currently all the IDs are URLs, and it is easy to tell the type of doc from the form of the URL. I think we can continue that indefinitely. So we don't need separate namespaces or metadata to identify the type of doc.
To support asking for a subset of types, we can just search for more documents and throw out the ones that don't match. That can be expensive, though, since we might have to do multiple searches with increasing limits until we get the docs we want. If we only let the user provide a threshold (max distance from the query) instead of a limit (number of documents), then a single call will do.
An alternative is to use a separate namespace for each type. Advantages are that the type of document would be more evident, and we could query different types concurrently. Disadvantages are that we'd have to rewrite everything, and we'd have to perform N queries instead of one and merge the results.
Currently we put every type of document into one vector DB:
Our Related Entities API (#22) may want to (a) let users ask for a subset of the possible types, and (b) classify results by type.
As far as classifying the results, currently all the IDs are URLs, and it is easy to tell the type of doc from the form of the URL. I think we can continue that indefinitely. So we don't need separate namespaces or metadata to identify the type of doc.
To support asking for a subset of types, we can just search for more documents and throw out the ones that don't match. That can be expensive, though, since we might have to do multiple searches with increasing limits until we get the docs we want. If we only let the user provide a threshold (max distance from the query) instead of a limit (number of documents), then a single call will do.
An alternative is to use a separate namespace for each type. Advantages are that the type of document would be more evident, and we could query different types concurrently. Disadvantages are that we'd have to rewrite everything, and we'd have to perform N queries instead of one and merge the results.