kjappelbaum / mofdscribe

An ecosystem for digital reticular chemistry
https://mofdscribe.readthedocs.io/en/latest/
MIT License
44 stars 7 forks source link

featurizer RACS() returns arrays containing NaNs #457

Open mehradans92 opened 1 year ago

mehradans92 commented 1 year ago

Hi @kjappelbaum. Just noticed that the featurizer RACS() returns some NaNs when I try to featurize some structures. The CIF files are coming from the QMOF dataset. Not sure if that a potential bug, or where this is coming from. Do you have any suggestions for troubleshooting?

kjappelbaum commented 1 year ago

Hi @mehradans92 thanks for the question.

This is a design choice, that I should have perhaps propagated into the public API https://github.com/kjappelbaum/mofdscribe/blob/9626b00d21c5abd54d5e4f2f6317dfa916461393/src/mofdscribe/featurizers/chemistry/racs.py#L34

It depends on whether you limit your neighbor search or if you let it expand the search radius. If I say I only look for neighbors on the metal node or linker, then there might not be a neighbor n bonds away. You can either set this to zero or to nan and currently it deals to nan. If you set the logging level accordingly (i.e. to debug, you should also see logging messages appear when it runs into those cases.

Note that some parts of this implementation are therefore different from the original one.

This behavior should have been documented a bit better, do you have other ideas on how we could improve it?

mehradans92 commented 1 year ago

Thanks for the explanation @kjappelbaum. I guess if you add an option as kwargs boolean for this it would be helpful. Or maybe you can just add what you described to the documentation. I am trying to understand how RACS() work based on your paper. For some reason, there are many non-zero features that are just constant values across a lot of different chemical structures (CIFs). Btw, the latex equations seem to not render correctly in the documentation. :)