lab-cosmo / chemiscope

An interactive structure/property explorer for materials and molecules
http://chemiscope.org
BSD 3-Clause "New" or "Revised" License
119 stars 29 forks source link

chemiscope.explore #346

Open sofiia-chorna opened 3 weeks ago

ceriottm commented 3 weeks ago

Hey @sofiia-chorna @bananenpampe I've many concerns about t-SNE cf https://distill.pub/2016/misread-tsne/

bananenpampe commented 3 weeks ago

I am aware of these shortcomimgs, but my counterargument would be, that the user will still inspect the dimensionality reduction and decide for themselves if they can deduce something from it. If the input was purely noise then nothing can be deduced from the clusters either.

Luthaf commented 2 weeks ago

So there a couple of things we need to decide here:

1) what should be the default featurization scheme 2) how can the user customize the featurization scheme 3) how can the user give additional properties to display

For (1), my main requirement is to find something that is easy to install on all platforms we support (this includes Windows) and does not break. I'm a bit worried using MACE for this, since it is relying on torch and looks a bit unstable to me. Rascaline + SOAP has the advantage of not requiring a pre-trained model, but fails when trying to handle many atomic types simultaneously, and is currently a pain to install. DScribe could be an alternative option as well.

I'm not a bit fan of t-SNE either because (a) it is stochastic and can produce different results over multiple runs; and (b) adding new data can completely change the clustering. For the default, I feel like something simple and stupid like PCA/KPCA would be better. This does not prevent us from giving an example of how to use the code with t-SNE for people who like this dimensionality reduction better, but PCA/KPCA strikes me as a much safer default.


For (2) I would go with chemiscope.explore(frames, featurize=some_function), where some_function takes the frames and returns a ndarray. The function would handle both feature calculation and dimensionality reduction. I think this is what's currently called reducer.

For (3) I would use the same kind of API as the rest of the function, passing an explicit properties argument. Then users who also want to extract properties from ASE Atoms can use chemiscope.extract_properties

ceriottm commented 1 week ago

Hi @sofiia-chorna I downloaded and tried this and I've a few "broad strokes" action items that I think are needed:

  1. the docstring of chemiscope.explore needs to be extended, a lot. The rationale must be explained, the API for the featurizer documented, and we need an example.
  2. the example is very heavy and not super-pedagogic. I'd start with a simple example using the base SOAP/PCA version to explain the workflow and the logic, and then go in steps of increasing complexity.
  3. I would avoid adding large, heavy files in the repo. maybe you can add them to zotero and fetch them in the example script.
  4. I am a bit concerned from the many dependencies we are pulling in. it works, so it's great, but we should have a clear (and fast) action plan to drop sections if the dependency chain breaks.
sofiia-chorna commented 1 week ago

Hello @ceriottm, thanks a lot for the feedback! I will address your points.

Regarding the docstrings of the function, we were experimenting a bit with the IncrementalPCA as the default featurizer, so I put writing the description of the function "on-hold" ^^

sofiia-chorna commented 1 week ago

I added a small diagram but it's fine for me to delete it, if it makes sense. The remaining datasets are relatively small (< 30 kb each), datasets with visualisation (to showcase the work of the featurizers) are fetched now from the zenodo.

And currently SOAP + PCA are used by default if no featurize argument is provided. The dependencies for it can be installed with pip install chemiscope[explore], it was added to the script for tox -e docs.