jacksund / simmate

The Simulated Materials Ecosystem (Simmate) is a toolbox and framework for computational materials research.
https://simmate.org
BSD 3-Clause "New" or "Revised" License
29 stars 8 forks source link

add postgres prebuilds #445

Open jacksund opened 1 year ago

jacksund commented 1 year ago

Describe the desired feature

just like how sqlite3 has prebuilds, we can do the same with postgres dump files: https://www.postgresql.org/docs/8.1/backup.html

Additional context

No response

To-do items

No response

scott-materials commented 1 year ago

I've used this in the past for exporting & importing databases, and I run an automated dump every night for backup.

Anything you're planning to implement here?

jacksund commented 1 year ago

The plan is to...

If you're interested in sharing the lab's calculation results with others, we can eventually add some warren lab data to the prebuilds too

jacksund commented 1 year ago

I'm not sure if this will be faster or slower than the load-remote-archives command, so there's a chance I scrap this feature too

scott-materials commented 1 year ago

It takes 4-12 hours to grab all the data for postgres when loading all archives, in my experience. I presume the bottleneck is that the cdn is rate limited.

I believe our database with all matproj, jarvis, cod, + oqmd is 4 GB for reference. Not sure how it compares to the original data that is stored at the cdn.

jacksund commented 1 year ago

I presume the bottleneck is that the cdn is rate limited.

Downloading from the CDN is actually really quick and only takes a few minutes with UNC's crazy internet speeds. The slow part is then taking that CSV data and then saving it your postgres database. Right now, thebottleneck is recalculating the MatProj hull energies for all systems, so I need to cache these. Once cached, I bet the load-remote-archives command will only take ~1-2 hrs.

Not sure how it compares to the original data that is stored at the cdn.

you can look at the files in ~/simmate/sqlite-prebuilds/ to see what's stored in the cdn. These are really just CSV files compressed into a ZIP and I think they're normally ~1-2GB