Add support for pgvector's hnsw (0.7.4) and generic support for Postgres (16) indexes

Closes https://github.com/harsha-simhadri/big-ann-benchmarks/issues/293.

Add .vscode to .gitignore
Add support for pgvector's hnsw and generic support for Postgres indexes See below explanation, this is the meat of the PR.
Install the things required to collect flamegraphs when needed Can revert this if requested.

This PR adds support for benchmarking pgvector's hnsw index-access-method with the runbooks and the datasets supported by bigann benchmarks.

To do that, added a base docker image that would help us testing other Postgres index-access-methods in the future. And to make use of that docker image, had to make some changes in install.py so that other Postgres based indexes can depend on a common docker image that has the Postgres installed already. Note that install.py will build that base docker image only if the algorithm name starts with "postgres-". If you see that this PR is not a draft one anymore, then I should've already documented this in the docs.

This PR also adds BaseStreamingANNPostgres that can be used to easily add support for other Postgres based indexes in the future. One would simply need to define a new python wrapper which implements:

determine_index_op_class(self, metric)
determine_query_op(self, metric)

and that properly sets the following attributes in their __init__ methods before calling super().__init__:

self.name
self.pg_index_method
self.guc_prefix

Given that pgvector's hnsw is the first Postgres-based-index that benefit from this infra (via this PR), neurips23/streaming/postgres-pgvector-hnsw/ can be seen as an example implementation on how to make use of Dockerfile.BasePostgres and BaseStreamingANNPostgres in general to add support for more Postgres based indexes.

Differently than other other algorithms under streaming, the time it takes to complete a runbook can be several times slower than what is for other algorithms. This is not because Postgres based indexes are bad, but because SQL is the only interface to such indexes. So, all those insert / delete / search operations first have to build the SQL queries, and, specifically for inserts, transferring the data to the Postgres server adds an important overhead. Unless we make some huge changes in this repo to re-design "insert" in a way that it could benefit from server-side-copy functionality of Postgres, we cannot do much about it. Other than that, please feel free to drop comments if you see any inefficiencies that I can quickly fix in my code. Note that I'm not a python expert, hence sincerely requesting this :)

And, to explain the build & query time params that have to be provided in such a Postgres based indexing algorithm's config.yaml file, let's take a look the the following snippet from pgvector's hnsw's config.yaml file:

random-xs:
    postgres-pgvector-hnsw:
      docker-tag: neurips23-streaming-postgres-pgvector-hnsw
      module: neurips23.streaming.postgres-pgvector-hnsw.wrapper
      constructor: PostgresPgvectorHnsw
      base-args: ["@metric"]
      run-groups:
        base:
          args: |
            [{"m":16, "ef_construction":64, "insert_conns":16}]
          query-args: |
            [{"ef_search":50, "query_conns":8}]

Presence of insert_conns & query_conns are enforced by BaseStreamingANNPostgres and any Postgres based index implementation that we add to this repo in the future must also provide values for them in their config.yaml files.

insert_conns Similar to insert_threads in other algorithm implementations, this is used to determine parallelism for inserts. In short, this determines the number of database connections used during insert steps.
query_conns Similar to T in other algorithm implementations, this is used to determine parallelism for SELECT queries. In short, this determines the number of database connections used during search steps.

Other than those two params, any other parameters that need to be specified when building the index or when performing an index-scan (read as "search" step) via config.yaml too.

The parameters provided in "args" (except insert_conns) are directly passed into CREATE INDEX statement that's used to create index in setup step. For example, for pgvector's hnsw, above config will result in the following CREATE INDEX statement to create the index. Especially note the "WITH" clause:
```
CREATE INDEX vec_col_idx ON test_tbl USING hnsw (vec_col vector_l2_ops) WITH (m = 16, ef_construction = 64);"
```
The parameters provided in "query-args" (except query_conns) are directly used to set the GUCs that determine runtime parameters used during "search" steps by the algorithm via SET commands. Note that BaseStreamingANNPostgres qualifies all those query-args with self.guc_prefix when creating the SET commands that need to be run for all Postgres connections. For example, for pgvector's hnsw, above config will result in the executing the following SET statement for each Postgres connection. Note that if pgvector's hnsw had more query-args, then we'd have multiple SET statements:
```
SET hnsw.ef_search TO 50;
```
We prefixed "ef_search" with "hnsw" since PostgresPgvectorHnsw sets self.guc_prefix to "hnsw".

And while we're at it, let's take a closer look into how the python wrapper should look like when adding support for a Postgres based index in the future. From the wrapper added for pgvector's hnsw:

from neurips23.streaming.base_postgres import BaseStreamingANNPostgres

class PostgresPgvectorHnsw(BaseStreamingANNPostgres):
    def __init__(self, metric, index_params):
        self.name = "PostgresPgvectorHnsw"
        self.pg_index_method = "hnsw"
        self.guc_prefix = "hnsw"

        super().__init__(metric, index_params)

    # Can add support for other metrics here.
    def determine_index_op_class(self, metric):
        if metric == 'euclidean':
            return "vector_l2_ops"
        else:
            raise Exception('Invalid metric')

    # Can add support for other metrics here.
    def determine_query_op(self, metric):
        if metric == 'euclidean':
            return "<->"
        else:
            raise Exception('Invalid metric')

self.name This probably sets the experiment name, mostly required by grand-parent class .BaseStreamingANN
self.pg_index_method is used to specify the index-access-method name in the USING clause of the CREATE INDEX statement that's used when creating the index. See the docs for CREATE INDEX mentioned earlier and CREATE INDEX statement shared above as an example of how we make use of it.
self.guc_prefix is used to qualify the GUCs that need to be set to enforce query-args, as described above.
determine_index_op_class(self, metric) is used to map given metric to the relevant opclass that the index needs to use when building the index and is passed to CREATE INDEX statements again. See the docs for CREATE INDEX mentioned earlier. For example, when the metric is "euclidian" for pgvector's hnsw, this function returns "vector_l2_ops" and so it was used in the above CREATE INDEX statement.
determine_query_op(self, metric) is used to map given metric to the comparison operator that's used to match the index during search. For example, when the metric is "euclidian" for pgvector's hnsw, this function returns "<->" and so it will be used in the SELECT query that's executed in search step as in:
```
SELECT id FROM test_tbl ORDER BY vec_col <-> [input-query-vec] LIMIT [k]
```

harsha-simhadri / big-ann-benchmarks

Add support for pgvector's hnsw (0.7.4) and generic support for Postgres (16) indexes #309