lamalab-org / xtal2txt

MIT License
6 stars 0 forks source link

feat: extract and serialize local environments #53

Closed kjappelbaum closed 5 months ago

kjappelbaum commented 5 months ago

implements draft of #50

Decomposes structure in unique local environments. Serializes those local environments into SMILES. Optionally creates a text string based on the SMILES, Wyckoff symbols and central atoms

@n0w0f what is the current strategy you use for dependency? Do you use pdm or something else?

kjappelbaum commented 5 months ago

@n0w0f can you remind me if there is a shared location with the structures? Then I could try to have a first run of this script.

n0w0f commented 5 months ago

@n0w0f can you remind me if there is a shared location with the structures? Then I could try to have a first run of this script.

we have a copy of the data at /work/structllm_data draco, I also have it in the gdrive.

n0w0f commented 5 months ago

@n0w0f what is the current strategy you use for dependency? Do you use pdm or something else?

pdm at some point started causing issue for me, now i just maintain the conda env and pyproject.toml is updated manually.

kjappelbaum commented 5 months ago

@n0w0f I now added info about the spacegroup (optionally).

A bit ugly is that one of the methods now also returns the space-group string. In the past, I worked with IStructure, which allows convenient caching-based design patterns. But IStructure is supposed to be deprecated.

The other alternative would be to share the symmetrized structure via the object's state. But then we would need to create a new object for every structure.

n0w0f commented 5 months ago

The other alternative would be to share the symmetrized structure via the object's state. But then we would need to create a new object for every structure.

I also favor the current implementation over having a object's state.

If you are concerned about having another return variable, we can add spacegroup to env itself, wherein env is nested with spacegroup and list[sites].

kjappelbaum commented 5 months ago

ok, I'll now just run some mining with the current implementation

kjappelbaum commented 5 months ago

Let's see how it goes, perhaps we can really see some redundance:

Total number of structures: 1965
Total number of environments: 8489
Total number of unique environments: 5296
kjappelbaum commented 5 months ago

@n0w0f let me know what I should still address in this PR. Otherwise, it might be good to merge if you want to have it in the codebase (to prevent that we pile up merge conflicts)

n0w0f commented 5 months ago

@n0w0f let me know what I should still address in this PR. Otherwise, it might be good to merge if you want to have it in the codebase (to prevent that we pile up merge conflicts)

I think we can merge this already. and maybe add smiles as a representation in a separate PR

kjappelbaum commented 5 months ago

ok, then let me check the failing test and I'll merge