icaros-usc / pyribs

A bare-bones Python library for quality diversity optimization.
https://pyribs.org
MIT License
205 stars 31 forks source link

Support unstructured archives [FEATURE REQUEST] #468

Closed gresavage closed 4 days ago

gresavage commented 1 month ago

Description

I am working on an algorithm which relies on an unstructured archive. This means the archive doesn't have a grid structure which is known ahead of time. Instead, the archive can grow unbounded and uses a "nearest neighbor" approach to add solutions. Unfortunately, the paper which describes this archive doesn't give a reference to where this type of archive originated so I can't provide a source. The paper I am referencing is here: https://dl.acm.org/doi/abs/10.1145/3583131.3590524

The archive uses a distance parameter $\lambda$ to determine if there is an entry in the "$\lambda$-ball" in measure space and adds the solution in the usual way. This is probably most similar to a CVT archive where the number and locations of the centroids are updated every time a solution is added... For a CVT archive recalculating the centroids can be resource intensive. The unstructured archive differs from CVT and Grid Archives in that the bounds and partitioning of metric space don't need to be known in advance, the number of cells is non-deterministic, and there is no need to calculate centroids for every update.

The specific unstructured archive mentioned here uses a Euclidean distance to determine the cells and whether or not objectives should be compared but this could be made more general to use other methods, such as a density-based approach (if the density of solutions in region of measure space is below a threshold then add the solution unconditionally, otherwise replace the lowest objective solution in the region), energy-based approach, etc.

btjanaka commented 1 month ago

Hi @gresavage; thanks for reaching out! The archive you mentioned sounds like the one from Novelty Search http://eplex.cs.ucf.edu/papers/lehman_ecj11.pdf Indeed, pyribs does not currently support it, but we have wanted to for a very long time.

I'll put this on my TODO list, but it may be a while (months) before I can get to it. If you have time to spare, I'd also be willing to guide you through a PR!

gresavage commented 4 weeks ago

@btjanaka Thanks for finding the source!

I am happy to initiate a PR. I have implemented a solution so it should be pretty quick and easy to contribute it to the project... assuming it works as intended :grimacing: . The biggest hurdle I had to solve was filtering the batch-adding of solutions when more than one solution in the incoming batch occupied the same "cell".

I also had the idea of a "lazy" archive where the solution_dim and measure_dim are inferred from the first call to add. I can open a separate issue for the "lazy" archive if you're open to that idea.

btjanaka commented 4 weeks ago

Ah nice! Feel free to open a PR. I had some thoughts on such an archive that I can share there.

For the lazy archive idea, I think inferring the dimensions would complicate initialization. For instance, the whole call to ArchiveBase.init would have to be done in a separate method. At the same time, I can see scenarios where you would need to do that. Is there a specific use case you had in mind? Feel free to open an issue and we can talk more there.

gresavage commented 4 weeks ago

@btjanaka Will do! FWIW, you'll see the UnstructuredArchive I implemented is still somewhat structured in the sense that all solutions must have the same solution_dim and measure_dim... since the Novelty Search paper uses NEAT, the topology changes and therefore so too does the solution_dim the solution would be more complicated.

I haven't thought yet about how to efficiently implement this degree of "free structure" while keeping with the np.ndarray-based data structure. My first thought is to use a masked array if we are strictly married to an np.ndarray for its compactness and speed.

btjanaka commented 3 weeks ago

One idea for variable solution sizes is to make the archive store solutions as objects. Currently, we assume the solutions, objectives, and measures will all be float, but it's certainly possible to have other data types. It's actually a pretty simple implementation given the ArrayStore data structure we use inside the archive, but I haven't had time to figure out the best way to make an API for this on the archive itself. For this implementation, I think it's best to just assume fixed-size continuous solutions for now, and when the different dtypes come along, the archive will easily inherit those changes.

By the way, does your implementation use ArrayStore? One of the reasons that I refactored the archives with ArrayStore was that I envisioned we could use an ArrayStore as the backbone of a novelty search archive. In particular, it would allow easily resizing the archive as the archive expanded (similar to how a C++ vector is implemented) since novelty search's archive is unbounded in size.

gresavage commented 3 weeks ago

Interesting, I had thought briefly of using object dtype, but I haven't ever used NumPy with anything other than the typical int/float/bool/str dtypes so I wasn't sure how well that was supported.

I didn't use the ArrayStore structure - I based my implementation on the v0.6.4 because that was the version I was using in my project and I assumed there wouldn't be significant "breaking" changes prior to a major version release.

I will investigate the ArrayStore structure.

btjanaka commented 3 weeks ago

Yup, numpy supports object arrays too. Basically each object is a differently sized array in this scenario.

Regarding changes to the API, yes that's a correct assumption but mostly for the public API. It's a bit fluid because we're still <1.0.0 so we're still figuring out what works best. We tried to minimize the API changes in 0.7.0 to changing parameter names that people don't really touch anyways, but the archive implementation has major changes to be more flexible, in particular with ArrayStore. Definitely take a look; with some luck it should make your life easier!

gresavage commented 2 weeks ago

@btjanaka I initiated PR #472

Sorry it took so long. Please take a look - I welcome your feedback. I am sure there are probably fancier ways of doing things.

btjanaka commented 2 weeks ago

No worries! Thank you so much for working on this. I will take a look shortly.

gresavage commented 2 weeks ago

@btjanaka I may have found an issue with _unstructured_archive.py, line 198

As I was using this in my own code, I found that it was possible for thenp.logical_or to return an empty array. Meaning that none of the top-k neighbors are either in the archive nor will be added.

I am trying to think whether this is a flaw in my implementation or if it is actually possible and expected outcome. The simple solution would be to change this section to instead check the entire archive + batch using the previous results from argsort like so:

        ...
        # find the nearest [k+1]-neighbors to determine the sparsity
        # use k+1 since self is included in the first k-neighbors
        sorted_dist_indices = np.argsort(distances, axis=1)
        top_k = sorted_dist_indices[:, : min(self._k_neighbors + 1, distances.shape[1])]

        # determine which entries can be newly inserted and resize
        # the store to accommodate them
        where_mask = np.zeros(distances.shape, dtype=np.bool_)
        where_mask[np.repeat(np.arange(batch_size, dtype=np.int32)[:, None], top_k.shape[1], axis=1), top_k] = True
        new_entries = np.sqrt(distances[where_mask]).reshape(batch_size, -1).sum(axis=1) / self._k_neighbors > self._sparsity_threshold
        new_entry_indicies = np.argwhere(new_entries) + curr_occ

        # use the nearest index already in the archive or the closest one which
        # will be added
        indices = np.array(
            [
                tk[
                    np.logical_or(
                        tk < curr_occ,
                        np.isin(tk, new_entry_indices, assume_unique=True),
                    )
                ][0]
                for tk in sorted_dist_indices
            ],
            dtype=np.int32,
        )
        ...

Also, I discovered the existence of np.isin which I believe is faster than what I had.