Open sofiia-chorna opened 3 weeks ago
I am aware of these shortcomimgs, but my counterargument would be, that the user will still inspect the dimensionality reduction and decide for themselves if they can deduce something from it. If the input was purely noise then nothing can be deduced from the clusters either.
So there a couple of things we need to decide here:
1) what should be the default featurization scheme 2) how can the user customize the featurization scheme 3) how can the user give additional properties to display
For (1), my main requirement is to find something that is easy to install on all platforms we support (this includes Windows) and does not break. I'm a bit worried using MACE for this, since it is relying on torch and looks a bit unstable to me. Rascaline + SOAP has the advantage of not requiring a pre-trained model, but fails when trying to handle many atomic types simultaneously, and is currently a pain to install. DScribe could be an alternative option as well.
I'm not a bit fan of t-SNE either because (a) it is stochastic and can produce different results over multiple runs; and (b) adding new data can completely change the clustering. For the default, I feel like something simple and stupid like PCA/KPCA would be better. This does not prevent us from giving an example of how to use the code with t-SNE for people who like this dimensionality reduction better, but PCA/KPCA strikes me as a much safer default.
For (2) I would go with chemiscope.explore(frames, featurize=some_function)
, where some_function
takes the frames and returns a ndarray. The function would handle both feature calculation and dimensionality reduction. I think this is what's currently called reducer
.
For (3) I would use the same kind of API as the rest of the function, passing an explicit properties
argument. Then users who also want to extract properties from ASE Atoms can use chemiscope.extract_properties
Hi @sofiia-chorna I downloaded and tried this and I've a few "broad strokes" action items that I think are needed:
chemiscope.explore
needs to be extended, a lot. The rationale must be explained, the API for the featurizer documented, and we need an example.Hello @ceriottm, thanks a lot for the feedback! I will address your points.
Regarding the docstrings of the function, we were experimenting a bit with the IncrementalPCA
as the default featurizer, so I put writing the description of the function "on-hold" ^^
I added a small diagram but it's fine for me to delete it, if it makes sense.
The remaining datasets are relatively small (< 30 kb each), datasets with visualisation (to showcase the work of the featurizers) are fetched now from the zenodo
.
And currently SOAP + PCA are used by default if no featurize
argument is provided. The dependencies for it can be installed with pip install chemiscope[explore]
, it was added to the script for tox -e docs
.
Hey @sofiia-chorna @bananenpampe I've many concerns about t-SNE cf https://distill.pub/2016/misread-tsne/