Pymatviz figures need refinement before they can be shown on main website/MP

ardunn commented 1 year ago

@janosh thanks for the PR and the work on pymatviz! A lot of the plots for pymatviz which are not yet in matbench (esp. the uncertainty ones) will come in very handy for matbench in the near future :) But for now the EDA ones seem like a good start

There are still some rough edges I'd like to iron out before we move it to be automatically shown on the main website

I made some edits to your code so the information artifacts (the bz2s), as well as the plots, are only generated if the script specifies two arguments. There is also some renaming of stuff (mostly appending pymatviz to the front of everything) to keep things consistent for naming, specifying a particular version of pymatviz to use, and copying a static set of htmls to the static docs dir instead of keeping them directly in the static docs dir (bc. *htmls get purged by nuke_docs on every run and the plots then need to be regenerated instead of just copied).

I was planning on showing the figures beneath the leaderboards for the "Per Task Leaderboards" by just stuffing them in iframes but this doesn't look particularly good at the moment (see screenshots).

For this reason, all the code (incl. artifacts etc.) for actually generating and putting the plots in the docs automatically is in the pymatviz_eda branch currently, not main.

The Problems (screenshots)

Colors are clashing with dark background

Width of frame is more than width of readable area, causing nasty looking horiz scroll bar and can't see colorbar)

Colors are clashing with dark background I also don't get what these plots are actually saying? Like the y axis is composition(!?), but if I hover over individual compositions along that y axis they rarely match the y axis... I'm just not really sure what the y axis represents here? Maybe we should change the targets in this case to be the target variable rather than composition? so like in your original PR #126 we could see breakdown of refractive index by crystal system etc.

This plot looks generally ok, but could we make the text bigger, and have the title text not blend into dark background?

The Solutions (I need help)

It seems like some of these things can be fixed by changing the pymatviz arguments, but I'm not sure the best way to do that.

Some of the other stuff seems like it needs to be edited in the iframe...I'm not so great at frontend web design so I was just formatting iframes based on the html filenames like so:

f'\n<iframe src="../../static/{pmv_eda_path}" class="is-fullwidth" height="700px" width="1000px" frameBorder="0"> </iframe>\n\n'

...but there surely has to be a better way...

Do you know how we could fix these things?

ardunn commented 1 year ago

Also tagging @mkhorton as some of these issues might be important when these figures are saved as json as loaded into the mp website...

ardunn commented 1 year ago

@janosh Another cool plot might be a clustering based on structure similarity, with the size of the dots being the value of the target or similar

ardunn commented 1 year ago

Looks like I was able to get a more reasonable plot for the violins:

Seems like the column index for selecting target variables was set to 2 instead of 1

But it seems like these plots only make sense for regression problems, they're not super informative for classification:

It also takes quite a while to load/render these plots, esp. for the larger datasets...

janosh commented 1 year ago

@ardunn Sorry for the silence here. Wanted to reply yesterday that the compositions along the y-axis must be coming from wrong column index but forgot. Too many things going on. 😄 Good thing you already fixed it! 👍

Definitely still committed to addressing your points and getting these plots onto the website. What would be the best way to collaborate on a PR? I can open one and then you as maintainer can commit to it but sounds like you already have some local changes. You want to merge them first? Or just push the branch and I branch off of that?

ardunn commented 1 year ago

@janosh Yes please! Just open a PR and then make sure the box that says "allow edits from maintainers" etc. is checked and I think I should be able to push commits to it as well. Also to reiterate this should be done on the pymatviz_eda branch (see here), as that branch is up to date and has all my local commits already on it

Also for simplicity, let me summarize and itemize the issues I had with the original plots:

Backgrounds and text clash with matbench color palette. Some figures are better (sunburst) or worse (periodic table) than others. I am not against changing the matbench website colors if need be but tbh I'm a dark mode person and I think the website looks better with dark colors (and all the graphics/graphs currently in matbench are optimized for dark mode)
Widths of some graphs in iframes are too wide (idk how to fix because I'm a frontend dummy lol)
(optional) Some kind of clustering based on structure similarity? Or compositional similarity for composition problems, as of right now composition problems only have the periodic table plot :/
(less important) There are some code issues still with how target variables are assigned etc (like the df.columns[2] thing is really likely to break) but these are things I can address

janosh commented 1 year ago

Definitely a dark mode guy here too! 😄 Should be easy to get plots without bg color.
Not an <iframe> fan but I think no way around that with Potly if we want interactivity. Will fiddle with some params and see what happens.
Maybe @sgbaird can chime if he has any recommendations regarding a composition clustering algorithm? Maybe element-movers distance or earth mover distance? Or what about DensMAP? Whatever procedure, it should be fairly fast so that clustering Matbench composition datasets with up to 1e4 samples (maybe more in the future) is feasible. @CompRhys suggested Sinkhorn distance (which I've never heard of before 😅).

ardunn commented 1 year ago

Definitely a dark mode guy here too! 😄 Should be easy to get plots without bg color.

Ok great! i think we should invert the title text as well

Not an <iframe> fan but I think no way around that with Potly if we want interactivity. Will fiddle with some params and see what happens.

This sounds good to me! I am generally not a fan of having links to outside services (i.e., plots hosted on plotly) bc. links are always susceptible to link rot or random stuff happening. If possible I'd like to have some static assets stored on the repo which can be loaded by the website quickly (~1s of loading page to render them). I'm not particularly concerned with how exactly this would be done (iframe, something else, etc.). So definitely open to other options

Maybe @sgbaird can chime if he has any recommendations regarding a composition clustering algorithm? Maybe element-movers distance or earth mover distance? Or what about DensMAP? Whatever procedure, it should be fairly fast so that clustering Matbench composition datasets with up to 1e4 samples (maybe more in the future) is feasible. @CompRhys suggested Sinkhorn distance (which I've never heard of before 😅).

I don't think the clustering itself needs to be fast as these figures (at least how I have it written in pymatviz_eda currently) won't be regenerated often, merely copied. I have no problem with these figures taking several minutes or even hours to regenerate as long as it can be done on a single laptop (i.e., a single person can regenerate them when a benchmark changes or is added).

What is more important is that the figures be informative!

sgbaird commented 1 year ago

Any objections to numba as a dependency (@ardunn for matbench and @janosh for pymatviz)? If so, that's fine, especially for Matbench based on the constraints @ardunn mentioned. @janosh if you're open to it, I'd be happy to make a PR to pymatviz for doing chemical clustering. If that's of interest, but numba is a no-go, there are other options.

@ardunn I might explore making it interactive (i.e. you can change some of the clustering parameters), but to make such a figure static, the HTML file could get large (not exactly sure how large without trying). Any sense of an upper limit I should try to stay within if I go down that route? 100 MB? 10 MB? 5 MB?

I suggest looking at this figure for an example of what I have in mind. In terms of being informative, I think something missing (that is on my backlog) is showing the N most frequent anonymous formulas and the N most frequent elements present in each cluster. I think that would help a lot with immediate interpretation.

Another visualization worth mentioning is target vs. (a proxy for) chemical novelty; see this example.

CompRhys commented 1 year ago

Any objections to numba as a dependency (@ardunn for matbench and @janosh for pymatviz)? If so, that's fine, especially for Matbench based on the constraints @ardunn mentioned. @janosh if you're open to it, I'd be happy to make a PR to pymatviz for doing chemical clustering. If that's of interest, but numba is a no-go, there are other options.

How much slower would it be with the pot based elemd vs numba. Think pot preferable due to environment robustness if possible?

sgbaird commented 1 year ago

@CompRhys I haven't compared to elemd. I think it's ~100-200x speedup relative to the original ElMD calculations. For reference, a $10,000 \times 10,000$ distance matrix can be computed in ~10-20s parallelized on 6 cores with chem_wasserstein which is based on dist-matrix. IIRC. Could use some more thorough benchmarking, but the speedup is ~2 orders of magnitude relative to ElMD.

CompRhys commented 1 year ago

@CompRhys I haven't compared to elemd. I think it's ~100-200x speedup relative to the original ElMD calculations. For reference, a 10,000×10,000 distance matrix can be computed in ~10-20s parallelized on 6 cores with chem_wasserstein which is based on dist-matrix. IIRC. Could use some more thorough benchmarking, but the speedup is ~2 orders of magnitude relative to ElMD.

just tested it with pot was 3 minutes for a 1k x 1k matrix on single core which seems ~2 orders of magnitude slower. Surprises me a little as pot seems to be a well maintained and documented library (https://github.com/PythonOT/POT) maybe my code around it linking to pymatgen is just glacial.

janosh commented 1 year ago

@sgbaird Cool, thanks for the great suggestions! 👍

Any objections to numba as a dependency (@ardunn for matbench and @janosh for pymatviz)? If so, that's fine, especially for Matbench based on the constraints @ardunn mentioned. @janosh if you're open to it, I'd be happy to make a PR to pymatviz for doing chemical clustering. If that's of interest, but numba is a no-go, there are other options.

I'm a bit reluctant about numba as its been a source of trouble over at https://github.com/materialsproject/crystaltoolkit/pull/270. At the same time, 2 OoM speed up sounds attractive and unattainable without compilation. So I would say let's all hop in on the pymatviz_eda branch and see what we can cook up. If we get it working, so much the better. 😄

sgbaird commented 1 year ago

@CompRhys I based it off of scipy.stats.wasserstein_distance (i.e. only supports scalar featurizers such as mod_petti, not elemental feature vectors such as mat2vec), and I ended up rewriting a lot of basic Numpy functions in plain Python code to make it GPU/Numba compatible. (Numpy/GPU/Numba support is pretty limited at the moment). This had the unintended but welcome benefit of increasing the CPU version's speed as well. Maybe also worth noting that the parallelization happens at the distance matrix level rather than within the distance calculations. So, I don't think the issue is with pot nor your pymatgen interfacing. I think it is due to my use of Cramer's approximation to a 1D Wasserstein distance and the low-level (much lower than I originally intended) optimizations. Also, I'm interested to hear your comments on the Sinkhorn distance.

@janosh sure thing! What about JAX as a dependency? It would be easy enough to whip up a JAX implementation given the Numpy support there, though there are no guarantees on the speedup. Sounds good about jumping on the pymatviz_eda branch 👍

ardunn commented 1 year ago

Any objections to numba as a dependency (@ardunn for matbench and @janosh for pymatviz)? If so, that's fine, especially for Matbench based on the constraints @ardunn mentioned. @janosh if you're open to it, I'd be happy to make a PR to pymatviz for doing chemical clustering. If that's of interest, but numba is a no-go, there are other options.

I'd prefer not to have it as a main dependency for matbench as numba can be finnicky and it is not really needed for the core matbench functionality. If we 100% absolutely truly need it, we can add it as a dependency for thedocs only or as a codependency for pymatviz only (though that would of course be @janosh 's call). I'd like to keep the dependencies for matbench itself very minimal.

@ardunn I might explore making it interactive (i.e. you can change some of the clustering parameters), but to make such a figure static, the HTML file could get large (not exactly sure how large without trying). Any sense of an upper limit I should try to stay within if I go down that route? 100 MB? 10 MB? 5 MB?

Eh I'd say something like 5MB should be the max. Otherwise maybe we can just use the "cdn" argument to the plotly write html method so that the extra js/css is referenced from plotly's servers. I think this is what @janosh did in the original commits and it reduced most html file sizes from like 10MB+ to a few hundred KB.

I suggest looking at this figure for an example of what I have in mind. In terms of being informative, I think something missing (that is on my backlog) is showing the N most frequent anonymous formulas and the N most frequent elements present in each cluster. I think that would help a lot with immediate interpretation.

Yes! That seems like a really good idea!

@CompRhys I based it off of scipy.stats.wasserstein_distance (i.e. only supports scalar featurizers such as mod_petti, not elemental feature vectors such as mat2vec), and I ended up rewriting a lot of basic Numpy functions in plain Python code to make it GPU/Numba compatible. (Numpy/GPU/Numba support is pretty limited at the moment). This had the unintended but welcome benefit of increasing the CPU version's speed as well. Maybe also worth noting that the parallelization happens at the distance matrix level rather than within the distance calculations. So, I don't think the issue is with pot nor your pymatgen interfacing. I think it is due to my use of Cramer's approximation to a 1D Wasserstein distance and the low-level (much lower than I originally intended) optimizations. Also, I'm interested to hear your comments on the Sinkhorn distance.

@janosh sure thing! What about JAX as a dependency? It would be easy enough to whip up a JAX implementation given the Numpy support there, though there are no guarantees on the speedup. Sounds good about jumping on the pymatviz_eda branch 👍

@CompRhys @sgbaird @janosh Perhaps before we go down the rabbit hole of optimizing this, how long does a typical clustering take without numba/JAX/pot/etc for one of the large datasets (e.g., mp_e_form)? Like minutes? Hours? Days? Weeks? The actual generation of the plots (and the heavy clustering) would only be run very infrequently and the graphs themselves would just be static assets. If the time to run non-optimized is something like hours or minutes, then to me it seems our two options are

(a) Run expensive clustering op infrequently, avoid extra optimization work, avoid potential dependency hell
(b) Run fast parallelized clustering op as infrequently or frequently as we want, do extra optimization work, potential dependency/installation hell and/or CI implosion when it can't install numba etc

... I'd choose (a)

sgbaird commented 1 year ago

Avoiding numba sounds OK to me👍

Didn't know about the "cdn" stuff. I'll keep that in mind. Thanks!

@CompRhys @sgbaird @janosh Perhaps before we go down the rabbit hole of optimizing this, how long does a typical clustering take without numba/JAX/pot/etc for one of the large datasets (e.g., mp_e_form)? Like minutes? Hours? Days? Weeks? The actual generation of the plots (and the heavy clustering) would only be run very infrequently and the graphs themselves would just be static assets. If the time to run non-optimized is something like hours or minutes, then to me it seems our two options are

(a) Run expensive clustering op infrequently, avoid extra optimization work, avoid potential dependency hell

(b) Run fast parallelized clustering op as infrequently or frequently as we want, do extra optimization work, potential dependency/installation hell and/or CI implosion when it can't install numba etc

... I'd choose (a)

Based on @CompRhys's estimated time of 3 min for $1000 \times 1000$, $130k \times 130k$ is $\approx 13^2$ times more computation, so ~8.45 hrs estimated using elemd but could also be parallelized across multiple cores. That would assume compositional clustering visualizations for the structure-based datasets. chem_wasserstein would probably be ~10 min total CPU time; the max I've used with it is ~30-40k entries. $\approx{1.3\times10^5}^2$ is past what most consumer hardware could store at once.

I should probably mention that UMAP (and by extension DensMAP) also depends on Numba and HDBSCAN depends on Cython (I'm guessing the latter isn't an issue). Part of our discussion here might be spilling a bit past Matbench's scope, since the runtime is a much bigger consideration for pymatviz than it would be for matbench. I don't think it's very easy to get around the numba dependency for umap-learn, and I'm not aware of non-numba-dependent alternative Python UMAP implementations. Personally, I think that UMAP is to be preferred over PCA, t-SNE, etc. for this kind of exploratory clustering visualization. I'm also not sure t-SNE would scale very well to 130k compounds (going off of memory). PCA would be fine in terms of scaling, but probably less appealing from a data visualization standpoint.

What about the option of uploading hard-coded 2D embeddings to FigShare (or even just the Matbench repo) for each of the Matbench datasets?

ardunn commented 1 year ago

That option seems pretty attactive. We could just keep it with the other matbench metadata and it would have about the same frequency of updates as that metadata (i.e., infrequently).

It might be worth first seeing what some of these clustering plots look like in regard to the target variables. For example, does a clustering of the expt_gap dataset where the points are scaled by target value actually reveal anything about the dataset, or is it just pretty? I'd be glad to look into that myself if you have some starter-ish code for doing so

sgbaird commented 1 year ago

That option seems pretty attactive. We could just keep it with the other matbench metadata and it would have about the same frequency of updates as that metadata (i.e., infrequently).

Sounds good! I think we should go with that.

It might be worth first seeing what some of these clustering plots look like in regard to the target variables. For example, does a clustering of the expt_gap dataset where the points are scaled by target value actually reveal anything about the dataset, or is it just pretty? I'd be glad to look into that myself if you have some starter-ish code for doing so

See elmd_densmap_cluster_colab.ipynb (Colab link). You could also have a look at bare_bones.py. With the larger datasets (thousands of compositions), the size attribute might get overwhelmed by the number of points. An option would be to allow toggling between coloring by cluster and coloring by target value (for coloring by the target values, see e.g. second plot from this section. Note: that particular figure is static).

I think the target values expressed in the manifold could provide a visual interpretation of the regression task's difficulty (i.e. the complexity of the response surface). Normally I think this is revealed in more quantitative measures like comparing the ratio between a dummy score and a baseline model score (which is already a nice part of Matbench); maybe this could reveal some other trends like "the model tends to struggle with these types of compositions", or "this indicates the model is highly structure-dependent because the response surface in composition space exhibits virtually no trend". Despite these comments, I think your comment "or is it just pretty?" is well-justified. Suppose the conclusions based on those visualizations don't corroborate tried-and-tested quantitative measures like the ratio between dummy score and a baseline model or something like $\frac{\mathrm{dummy} - \mathrm{compositionBaseline}}{\mathrm{dummy} - \mathrm{structureBaseline}}$. In that case, it's worth wondering if it's presenting something meaningful/useful.

Having invested in composition-based models, I recognize I have some bias: for example, a tendency to ask "what's the best way to visualize just the compositional information?" rather than "how useful is it really to visualize composition-only information?"

cc also @SurgeArrester who might have some suggestions

SurgeArrester commented 1 year ago

When I've been testing ElMD I've also been running into dependency issues with numba, and I wish it was an avoidable design choice. I avoid the issue in my work by forcing the python version to be 3.7, but I don't think that's a viable option for a larger repo.

For the speed differences I haven't done a deep dive into why pot is slower as there are a lot of OT solvers to look at. Although I would expect the pot implementation of the network simplex to be comparable to ElMD, I have typically found this runs much slower, presumably from the numba speedup in ElMD. The Sinkhorn distance is a linear time algorithm which should theoretically run much faster than the network simplex (which is O(n^2 log(n)) in the worst case). However this is an iterative algorithm, and when I have implemented it I found that whilst the core for loop is a very simple operation it may take several hundred iterations of the loop to find the solution. By comparison the network simplex often generates the optimal solution as part of its initialisation and then never even enters the main optimisation loop, or it will typically find the optimal solution in a few iterations of the (much longer) optimisation loop. The Sinkhorn algorithm should run a lot faster on a GPU, but the time overhead of writing to VRAM may in fact slow down the whole operation rather than simply using the CPU cache where the object is already loaded.

The simplex algorithm has a O(2^n) running time for the worst case counter example, but the average running time is empirically shown to be less than O(n^3) (often less than O(n^2)) for randomly initialised problems. I haven't found a direct reference for empirical running times of the network simplex, but I believe a similar phenomena is occurring for composition matching problems (Vanderbrei, Linear Programming, '20)

If it is just the distance that is required without the transportation plan and a monotonic 1D elemental scale is acceptable (e.g. mendeleev numbers but not elemental emebddings) then the method discussed in this issue should be the fastest. I can't find a reference for how I derived this implementation so I'm not 100% sure whether this aligns with the method given in the literature though, https://www.imagedatascience.com/transport/OTCrashCourse.pdf slide 45, https://arxiv.org/pdf/1804.01947.pdf section IV A.

For linear compositional clustering I used to create these by generating a full distance matrix and projecting these to 1D PCA embeddings, but I've found that simply computing the distance to hydrogen and sorting based on this value is a much faster method that gives a comparable ordering. A list of ElMD objects can be sorted this way using the built in sorted()/.sort() methods on the list

SurgeArrester commented 1 year ago

I've added this method to ElMD==0.5.3, it definitely seems to give a fair speedup and seems to be in agreement with the network simplex to the 4th decimal place. You can't access the transport plan with this approach, but I don't think that's needed for most operations

SurgeArrester commented 1 year ago

I've removed the numba dependency in ElMDpy which is available via pip install ElMDpy. This should work on all versions of python, and uses the fast method by default

janosh commented 1 year ago

Wow, super cool @SurgeArrester! I'm learning a lot just from reading this thread. 😄

I just tried ElMD and it runs fine under py3.10 if you don't use the new metric="fast" with 2 simple tricks:

Install it with pip install ElMD --no-deps so that it doesn't try to downgrade numpy

In python3.10/site-packages/ElMD/ElMD.py, replace

from numba import njit

with

try:
    from numba import njit
except ImportError:
    def njit(*args, **kwargs):
        def decorator(func):
            return func
        return decorator

so that every JIT function just becomes it's uncompiled equivalent.

Maybe worth adding a note in the readme for py3.8+ users.

Also, GH comments support KaTeX math using $x$ delimiters: $\mathcal O(n^2 \log n)$.

Anyways, haven't started working on this yet but it's high on my list.

SurgeArrester commented 1 year ago

Ahh excellent suggestion that's a much cleaner solution many thanks, pushed to latest version

mkhorton commented 1 year ago

Another option is https://github.com/ptooley/numbasub if you wanted to use additional numba features. Looks unmaintained but I imagine it still works, unfortunate it's not on PyPI. (Edit: given the license, could probably also just copy the nonumba.py file into a project)

sgbaird commented 1 year ago

@SurgeArrester nice! I ran a comparison and it looks like there's ~2x speedup for ElMDpy (30 min) relative to ElMD (60 min).

@ardunn back to the topic of computing it once and loading it, here's a notebook that does the clustering for matbench_expt_gap.

sgbaird commented 1 year ago

@faris-k mentioned https://projector.tensorflow.org/ to me. For compositions, a heatmap image of the periodic table could be used. A 2D image of the crystal structure could also be used for structures.

materialsproject / matbench