Some elements dropped while encoding to mod_pettifor representation

sgbaird commented 1 year ago

The following produces a list of 118 unique elements (disclaimer: contains unrealistic entries):

np.unique([elem.symbol for elem in list(_data.keys())])
array(['Ac', 'Ag', 'Al', 'Am', 'Ar', 'As', 'At', 'Au', 'B', 'Ba', 'Be',
       'Bh', 'Bi', 'Bk', 'Br', 'C', 'Ca', 'Cd', 'Ce', 'Cf', 'Cl', 'Cm',
       'Cn', 'Co', 'Cr', 'Cs', 'Cu', 'Db', 'Ds', 'Dy', 'Er', 'Es', 'Eu',
       'F', 'Fe', 'Fl', 'Fm', 'Fr', 'Ga', 'Gd', 'Ge', 'H', 'He', 'Hf',
       'Hg', 'Ho', 'Hs', 'I', 'In', 'Ir', 'K', 'Kr', 'La', 'Li', 'Lr',
       'Lu', 'Lv', 'Mc', 'Md', 'Mg', 'Mn', 'Mo', 'Mt', 'N', 'Na', 'Nb',
       'Nd', 'Ne', 'Nh', 'Ni', 'No', 'Np', 'O', 'Og', 'Os', 'P', 'Pa',
       'Pb', 'Pd', 'Pm', 'Po', 'Pr', 'Pt', 'Pu', 'Ra', 'Rb', 'Re', 'Rf',
       'Rg', 'Rh', 'Rn', 'Ru', 'S', 'Sb', 'Sc', 'Se', 'Sg', 'Si', 'Sm',
       'Sn', 'Sr', 'Ta', 'Tb', 'Tc', 'Te', 'Th', 'Ti', 'Tl', 'Tm', 'Ts',
       'U', 'V', 'W', 'Xe', 'Y', 'Yb', 'Zn', 'Zr'], dtype='<U2')

However, when encoding these in the "mod_pettifor" representation, there are 103 unique values:

mod_petti = [encode(k, "mod_pettifor") for k in _data.keys()]
mod_petti_comp = dict(zip(mod_petti, _data.values()))

mod_petti_comp
dict_keys([23, 25, 93, 90, 101, 96, 59, 8, 69, 1, 16, 51, 80, 82, 12, 33, 64, 92, 26, 52, 55, 48, 20, 50, 70, 100, 53, 54, 71, 2, 94, 27, 57, 87, 65, 32, 24, 75, 79, 61, 63, 83, 43, 14, 10, 72, 15, 7, 86, 28, 3, 89, 62, 19, 22, 13, 81, 60, 67, 30, 0, 99, 56, 38, 34, 29, 21, 4, 31, 17, 36, 95, 66, 58, 74, 68, 85, 49, 45, 18, 73, 47, 77, 44, 91, 46, 98, 40, 37, 39, 78, 84, 76, 41, 88, 5, 97, 9, 6, 35, 42, 102, 11])

Not sure if #15 is related.

This is a blocker for using matbench-genmetrics with xtal2png+imagen-pytorch in https://github.com/sparks-baird/xtal2png/issues/204, but not super time-sensitive. The fact that it's producing values from all 118 periodic elements despite not all elements being represented in the training dataset (pretty sure) is a concern from the generative modeling standpoint.

For context, the script I'm running is https://github.com/sparks-baird/matbench-genmetrics/blob/main/scripts/load_imagen_pytorch_generated.py.

kjappelbaum commented 1 year ago

yes, there are certain elements that have non-unique codings in some of the encodings (therefore the warning #15). I can look into making a version of the mod-pettifor that removes this issue.

TBH, I didn't so far look into whether it is a bug or expected behavior.

sgbaird commented 1 year ago

Worked around it in the code. I just needed to remove the "symbols" column from the DataFrame I made. I wasn't using the "symbols" data anyway.

https://github.com/sparks-baird/matbench-genmetrics/blob/76dc21948b4a61eaa3224c56e289543aabacd985/src/matbench_genmetrics/utils/featurize.py#L66-L72

    mod_petti_df = pd.DataFrame(
        dict(symbol=_data.keys(), mod_petti=mod_petti_comp.keys(), contribution=mod_petti_comp.values()),
    ).sort_values("mod_petti")

changed to:

    mod_petti_df = pd.DataFrame(
        dict(mod_petti=mod_petti_comp.keys(), contribution=mod_petti_comp.values()),
    ).sort_values("mod_petti")

kjappelbaum commented 1 year ago

sorry for coming back to this so late.

Do you have a preferred way of solving this? I also do not like that https://github.com/kjappelbaum/element-coder/blob/fa6a02503449c9cc38017e98ba4475804a841dbf/src/element_coder/data/raw/mod_petti.json#L105-L120 all code to the same value as He. The question is only what to replace them with. I see the following options:

Leave as is (will raise the warning and users need to think how they deal with it)
Remove duplicated entries (will except if element has not a defined encoding, one could catch this with some fill value)
Replace the values ourselves with something else

kjappelbaum / element-coder

Some elements dropped while encoding to mod_pettifor representation #21