Open sgbaird opened 1 year ago
yes, there are certain elements that have non-unique codings in some of the encodings (therefore the warning #15). I can look into making a version of the mod-pettifor that removes this issue.
TBH, I didn't so far look into whether it is a bug or expected behavior.
Worked around it in the code. I just needed to remove the "symbols" column from the DataFrame I made. I wasn't using the "symbols" data anyway.
mod_petti_df = pd.DataFrame(
dict(symbol=_data.keys(), mod_petti=mod_petti_comp.keys(), contribution=mod_petti_comp.values()),
).sort_values("mod_petti")
changed to:
mod_petti_df = pd.DataFrame(
dict(mod_petti=mod_petti_comp.keys(), contribution=mod_petti_comp.values()),
).sort_values("mod_petti")
sorry for coming back to this so late.
Do you have a preferred way of solving this? I also do not like that https://github.com/kjappelbaum/element-coder/blob/fa6a02503449c9cc38017e98ba4475804a841dbf/src/element_coder/data/raw/mod_petti.json#L105-L120 all code to the same value as He. The question is only what to replace them with. I see the following options:
The following produces a list of 118 unique elements (disclaimer: contains unrealistic entries):
However, when encoding these in the
"mod_pettifor"
representation, there are 103 unique values:Not sure if #15 is related.
This is a blocker for using matbench-genmetrics with xtal2png+imagen-pytorch in https://github.com/sparks-baird/xtal2png/issues/204, but not super time-sensitive. The fact that it's producing values from all 118 periodic elements despite not all elements being represented in the training dataset (pretty sure) is a concern from the generative modeling standpoint.
For context, the script I'm running is https://github.com/sparks-baird/matbench-genmetrics/blob/main/scripts/load_imagen_pytorch_generated.py.