tiffany352 / rink-rs

Unit conversion tool and library written in rust
https://rinkcalc.app/about
GNU General Public License v3.0
408 stars 28 forks source link

Scrape some sort of physical properties database #2

Open whitequark opened 7 years ago

whitequark commented 7 years ago

I tried to do this today, expecting something with densities:

> 3.7 billion l * water -> ton
Conformance error: 7256921/200000, approx. 36.28460 giganewton (force) != 45359237/50000, approx. 907.1847 kilogram (mass)
Suggestions: divide left side by acceleration, multiply right side by acceleration

Then I tried to see what various substance names map to, and it's kind of a mess...

> water
Definition: water = gram force / cm^3 = 9806.65 pascal / meter (kg / m^2 s^2)
> mercury
Definition: mercury = 200.59 g / mol = 0.20059 kilogram / mole (molar_mass; kg / mol)
> milk
Definition: milk = 242 g / uscup = approx. 1022.874 kilogram / meter^3 (density; kg / m^3)
> oil
Definition: oil = 7.5 oz / uscup = approx. 898.6982 kilogram / meter^3 (density; kg / m^3)
> gasoline
Definition: gasoline_HHV = 125000 btu / usgallon = approx. 34.83953 gigapascal (pressure; kg / m s^2)
> air
Definition: air = 78.08 % nitrogen 2 + 20.95 % oxygen 2 + 9340 ppm argon + 400 ppm (carbon + oxygen 2) + 18.18 ppm neon + 5.24 ppm helium + 1.7 ppm (carbon + 4 hydrogen) + 1.14 ppm krypton + 0.55 ppm hydrogen 2 = approx. 0.02896790 kilogram / mole (molar_mass; kg / mol)
whitequark commented 7 years ago

I'm actually not even sure at all what the heck the water unit refers to; mmH2O? If yes why doesn't mercury do the same...

whitequark commented 7 years ago

I guess what I want is basically the totality of [insert chemical database here], accessible via f(formula | substance_name) where f = density, molar_weight, ...

tiffany352 commented 7 years ago

Water seems to be the weight per volume of water, and it is used to define mmH2O by multiplying it with length. Confusingly, "mercury" and "Hg" actually refer to different things.

I would like to source data from a chemical database, but I haven't really been able to find one which either lets me download the data or has an API.

However, whether or not I have such a database I should still be able to resolve this problem by introducing an explicit notion of substances with multiple properties.

whitequark commented 7 years ago

I would like to source data from a chemical database, but I haven't really been able to find one which either lets me download the data or has an API.

Summoning @bofh453

tiffany352 commented 7 years ago

I've added substances in a branch. It looks like this:

> density of test
1200 kilogram / meter^3 (density)
> mass of ml test
1.2 gram (mass)

(test being a made up substance)

Next I have to update the definitions file to use them.

bofh453 commented 7 years ago

This turns out to be shockingly hard. Like, in general (especially for "engineering" properties such as the bulk/elastic/shear/Young's moduli), the only solution is scraping papers and patents (this often requires OCRing first, especially for the patents. Okay, less so now, but you still sometimes need to double-check Google Patents OCR'd the thing correctly).

That being said, there's a bunch of stopgaps. That's the good thing. The bad thing is most require at minimum registration. The two big ones to start with are http://www.chemnetbase.com/ (ChemNetBase) and http://www.chemspider.com/ (RSC ChemSpider). Both have fairly comprehensive web APIs for bulk-fetching of data, I believe the latter's is still open to anyone without login, but I'm not sure.

Edit: NIST has already nicely scraped the CRC handbook into a DB for you. Available here: https://www.nist.gov/pml/productsservices/physical-reference-data

For simple stuff, such as atoms and simple compounds, just scrape all of the CRC Handbook of Chemistry and Physics into a textfile. It's probably been done before to something structured, though you can already get something almost easily parsable just by grabbing the 2014 copy of it from libgen and running pdftotext -raw CRCHandbook.pdf CRCHandbook.txt.

This handbook, btw, is the source of most of the periodic table data you've seen anywhere, though it may have hopped through 3-15 reprints to get there. Turns out both getting and aggregating experimental data is hard.

Other useful things:

bofh453 commented 7 years ago

Oh, one more thing: NIST has a ton of spectral data easily available: http://webbook.nist.gov/chemistry/name-ser.html

No official API, but seeing as I can basically programmatically fetch things by hand using extremely trivial curl POST requests, and there are no ratelimits, well, yeah.

tiffany352 commented 7 years ago

Not quite as convenient as I was hoping for, but I'll definitely having a go at obtaining the data from these sources.

tiffany352 commented 7 years ago

I've pushed support for substances to master. The original issue should be resolved now, but I will leave this open for the second part about sourcing data.

> 3.7 billion l * water -> ton
water: volume = 3700000 meter^3; mass = approx. 4078551.8 shortton
> water
water: density = 1 gram -> 1000 millimeter^3; fusion_heat = 8352666/25000, approx. 334.1066 joule -> 1 gram; pressure_column = 98.0665 pascal -> 10 millimeter; pressure_column_0C = approx. 98.05375 pascal -> 10 millimeter; pressure_column_100C = approx. 93.98497 pascal -> 10 millimeter; pressure_column_10C = approx. 98.04002 pascal -> 10 millimeter; pressure_column_15C = approx. 97.98118 pascal -> 10 millimeter; pressure_column_18C = approx. 97.93116 pascal -> 10 millimeter; pressure_column_20C = approx. 97.89292 pascal -> 10 millimeter; pressure_column_25C = approx. 97.77916 pascal -> 10 millimeter; pressure_column_50C = approx. 96.89656 pascal -> 10 millimeter; pressure_column_5C = approx. 98.06551 pascal -> 10 millimeter; specific_heat = 4.1868 kilogray -> 1 kelvin; vaporization_heat = 1.16 kilojoule -> 1 gram
> mercury
mercury: density = 13.5951 gram -> 1000 millimeter^3; molar_mass = 200.59 gram -> 1 mole; pressure_column = approx. 1.333223 kilopascal -> 10 millimeter; pressure_column_10C = approx. 1.330840 kilopascal -> 10 millimeter; pressure_column_20C = approx. 1.328428 kilopascal -> 10 millimeter; pressure_column_23C = approx. 1.327683 kilopascal -> 10 millimeter; pressure_column_30C = approx. 1.326025 kilopascal -> 10 millimeter; pressure_column_40C = approx. 1.323632 kilopascal -> 10 millimeter; pressure_column_60F = approx. 1.329526 kilopascal -> 10 millimeter; specific_heat = 140 gray -> 1 kelvin
> milk
milk: density = 242 gram -> 473176473/2000, approx. 236588.2 millimeter^3
> oil
oil: specific_energy = 41.868 gigajoule -> 45359237/50000, approx. 907.1847 kilogram (Ton oil equivalent.  A conventional value for the energy released by burning one metric ton of oil. [18,E2] Note that energy per mass of petroleum products is fairly constant. Variations in volumetric energy density result from variations in the density (kg/m^3) of different fuels. This definition is given by the IEA/OECD.)
> gasoline
gasoline: energy_density_HHV = approx. 131.8819 megajoule -> 473176473/125, approx. 3785411.7 millimeter^3; energy_density_LHV = approx. 121.3314 megajoule -> 473176473/125, approx. 3785411.7 millimeter^3; specific_heat = 2.22 kilogray -> 1 kelvin
> air
air: Average molecular weight of air. molar_mass = approx. 28.96790 gram -> 1 mole
tiffany352 commented 7 years ago

@bofh453 I'm having some difficulty with these sources. The NIST data only seems to have a small subset of what the CRC handbook offers - it doesn't seem to have any properties other than stuff like molar mass and ionization energy of the elements. I already have molar masses for all the elements, but the data isn't cited. CHEMnetBase wants me to login with a subscribing organization to view data, and Chemspider seems to only have predicted properties for the queries I've tried so far - are these predicted properties accurate?

Unless I'm missing something, I may have to obtain a PDF of the CRC handbook and get the data out like you said.

For reference, here's some of the properties I'm interested in (don't necessarily need or want all of them at the same time):

As far as what I'm interested in the properties of, I'd like to get all of the elements (possibly for more than one isotope? e.g. uranium-238 and uranium-235) as well as a number of common materials like stone, wood, glass, steel, oil, gasoline.

Does the CRC handbook even have this data? You did say engineering properties are difficult to come by, and that's pretty much exactly what I'm looking for... I'm not sure where to start with OCRing patents, but that sounds like quite a lot of manual work to extract that for 118 elements. Should I give up on getting this data for elements in general and focus on the materials I mentioned? That way the data set is small enough that I can enter it by hand.

whitequark commented 7 years ago

https://twitter.com/MatWeb/status/783691423107481600

JasperWallace commented 3 years ago

Dwarf Fortress has been slowly building a list of material properties with help from the players on the forums, I'm not sure about the license on that collection tho:

https://dwarffortresswiki.org/index.php/DF2014:Material_definition_token