materialsvirtuallab / m3gnet

Materials graph network with 3-body interactions featuring a DFT surrogate crystal relaxer and a state-of-the-art property predictor.
BSD 3-Clause "New" or "Revised" License
243 stars 62 forks source link

Full matterverse.ai data available? #23

Closed sgbaird closed 2 years ago

sgbaird commented 2 years ago

i.e. the ~30e6 materials from https://matterverse.ai/

shyuep commented 2 years ago

Yes. There is an undocumented hacky API based on flamyngo. E.g., https://matterverse.ai/M3GNet_data_websearch/doc/mv-31324330/json gives you the json doc for mv-31324330. You should be able to get the entire dataset quite easily. We are building a proper API based on OPTIMADE but that will take a bit of time. Perhaps you can let us know your interest and we can send you the entire json dump?

sgbaird commented 2 years ago

@shyuep thank you! I don't have an immediate intended use. Eventually, I'd be interested in combining the matterverse dataset with datasets from other generative models and doing a large-scale property-based screening similar to https://dx.doi.org/10.1126/sciadv.abn4117 and several other papers.

Also, where would you suggest uploading ongoing outputs from crystal generative models? NOMAD comes to mind as one that likely wouldn't require database expertise. FigShare, Zenodo, etc. are easy data dumps but without an API. Creating a search interface on a custom website is nice, but limited if there's no API or way to access the data (in full if needed) programmatically.

shyuep commented 2 years ago

I see. We already have an API and will make it OPTIMADE compliant in some near future. Feel free to use it and let us know if you need the whole set in a file. The problem here is of course when we deal with MILLIONS or even BILLIONS of entries, any data dump is slow. E.g., a CIF file is easily 1-10 kb (assuming you just want structure and not history and other stuff). One million structures is already 1Gb.

When generating hypothetical structures, the space is orders of magnitude larger what you get in a expt-based system like Materials Project. The usual API support becomes very different in scale.

shyuep commented 2 years ago

I should also add that any work that merely shows the ability to generate structures below the convex hull is not stringent enough in my opinion (and m3gnet falls within that definition too). Without any ML or any other model, I can easily propose a gazillion DFT-Ehull < 0 compounds due to known errors in DFT. E.g., any peroxide/persulfide or higher superoxides or supersulfides are artificially stabilized. So DFT ehull < 0 is merely an initial proof of concept that an ML algo may be useful. It is by no means proof that an ML algo is actually useful in generating new structures. The value of the M3GNet IAP is not merely in relaxation and structure prediction, but in the MD simulations that can be done.

A structure prediction algorithm that generates REAL structures need experimental proof. I can count on maybe my ten fingers how many works have actually shown a ML/DFT->real expt material synthesis demonstration. There are very few works that show such stuff.