sarahshi / mineralML

mineralML: Machine Learning for Probabilistic Mineral Classification
https://mineralML.readthedocs.io/
GNU General Public License v3.0
6 stars 1 forks source link

Perhaps dont make Mineral a default column, or return an explanation #3

Closed PennyWieser closed 4 months ago

PennyWieser commented 4 months ago

I am processing some data off the SEM where I dont want to guess the phases initially.

If I dont have a Mineral column, it returns a key error. I think it should be possible for people not to enter a mineral, and it just puts it as 'unknown' or something. Then it just prints a warning, like if you dont give it a mineral, it cant fill the columns blah blah.

similarly, if I enter df['Mineral']='Ol' because i've forgotten what MinML calls it, it returns an empty dataframe for df_pred_nn, but if I do df['Mineral']='Olivine' it works.

Again, I feel if someone enters something that isnt a recognised name in MinML, it should either return an error explaining, or replace it with a supported 'NaN'-esq category.

df_load = mm.load_df('Berkeley_EDS_tests.csv')

df_load['Mineral']='Olivine'

dfnn, = mm.prep_df_nn(df_load) df_pred_nn, probability_matrix = mm.predict_class_prob_nn(df_nn) df_pred_nn

----> 3 dfnn, = mm.prep_df_nn(df_load) 4 df_pred_nn, probability_matrix = mm.predict_class_prob_nn(df_nn) 5 df_pred_nn

c:\Users\penny\anaconda3\Lib\site-packages\mineralML\supervised.py in ?(df) 74 include_minerals = ['Amphibole', 'Biotite', 'Clinopyroxene', 'Garnet', 'Ilmenite', 75 'KFeldspar', 'Magnetite', 'Muscovite', 'Olivine', 'Orthopyroxene', 76 'Plagioclase', 'Spinel'] 77 exclude_minerals = ['Tourmaline', 'Quartz', 'Rutile', 'Apatite', 'Zircon'] ---> 78 df.dropna(subset=oxidesandmin, thresh=6, inplace=True) 79 80 if 'Mineral' in df.columns: 81 include_minerals = ['Amphibole', 'Biotite', 'Clinopyroxene', 'Garnet', 'Ilmenite',

c:\Users\penny\anaconda3\Lib\site-packages\pandas\core\frame.py in ?(self, axis, how, thresh, subset, inplace, ignore_index) 6417 ax = self._get_axis(agg_axis) 6418 indices = ax.get_indexer_for(subset) 6419 check = indices == -1 6420 if check.any(): -> 6421 raise KeyError(np.array(subset)[check].tolist()) 6422 agg_obj = self.take(indices, axis=agg_axis) 6423 6424 if thresh is not lib.no_default:

KeyError: ['Mineral']

Berkeley_EDS_tests.xlsx

image image

PennyWieser commented 4 months ago

On a related note, I have just tried data where its all olivines and Opx from the SEM, so no K2O, Na2O. image

Right now I get KeyError: ['Na2O', 'K2O'] I wonder if columns are missing, if we just fill them with zeros - this is what thermobar does - possibly with a print warning like 'no Na2O, we have assumed its zero for all samples')

sarahshi commented 4 months ago

Excellent point. This highlights a choice I made early on during development, for the separation of the minerals that are actually classified by the neural network and those we are basing off solely geochemistry (tourmaline, quartz, rutile, apatite, zircon). include_minerals = ['Amphibole', 'Biotite', 'Clinopyroxene', 'Garnet', 'Ilmenite', 'KFeldspar', 'Magnetite', 'Muscovite', 'Olivine', 'Orthopyroxene', 'Plagioclase', 'Spinel'] exclude_minerals = ['Tourmaline', 'Quartz', 'Rutile', 'Apatite', 'Zircon']

This is why prep_df_nn can't recognize abbreviations — it is solely searching from that list of minerals. This in retrospect seems a bit silly. I have removed this and this classification will happen later on. It solely now returns one dataframe. Additionally, those missing columns are filled. See commit 3ba10bd for full resolution.