softwareunderground / repro-zoo

Open & executable reproductions of figures and other results from papers in earth science & engineering.
Apache License 2.0
10 stars 8 forks source link

Integration of sparse and continuous data sets [...] for core mineralogy interpretation #36

Open kwinkunks opened 10 months ago

kwinkunks commented 10 months ago

Integration of sparse and continuous data sets using machine learning for core mineralogy interpretation

Mayur Nawal, Bharath Shekar, and Priyank Jaiswal

https://doi.org/10.1190/tle42060421.1

In earth science, integrating noninvasive continuous data streams with discrete invasive measurements remains an open challenge. We address such a problem — that of predicting whole-core mineralogy using discrete measurements with the help of machine learning. Our targets are sparsely sampled mineralogy from X-ray diffraction, and features are continually sampled elemental oxides from X-ray fluorescence. Both data sets are acquired on a core cut from a Mississippian-age mixed siliciclastic-carbonate formation in the U.S. midcontinent. The novelty lies in predicting multiple classes of output targets from input features in a small multidimensional data setting. Our workflow has three salient aspects. First, it shows how single-output models are more effective in relating selective target-feature subsets than using a multi-output model for simultaneously relating the entire target-feature set. Specifically, we adopt a competitive ensemble strategy comprising three classes of regression algorithms — elastic net (linear regression), XGBoost (tree-based), and feedforward neural networks (nonlinear regression). Second, it shows that feature selection and engineering, when done using statistical relationships within the data set and domain knowledge, can significantly improve target predictability. Third, it incorporates k-fold cross-validation and grid-search-based parameter tuning to predict targets within 4%–6% accuracy using 40% training data. Results open doors to generating a wealth of information in energy, environmental, and climate sciences where remotely sensed data are inexpensive and abundant but physical sampling may be limited due to analytic, logistic, or economic issues.