Translate structure into a unique key

samwaseda commented 2 years ago

It's been some time since I started wondering whether there is a simple way to translate a structure to a unique key, in order to use it in the job name definition. I post it here because firstly someone might have an even better idea and also because I want to be able to find it by searching it on GitHub.

def struct_to_tag(structure):
    repeat = (np.ones(3) + structure.pbc).astype(int)
    structure = structure.repeat(repeat)
    d = np.log(structure.get_distances_array(structure.positions, structure.positions) + 1)
    Jij = np.array([np.prod([np.log(ord(ss)) for ss in s]) for s in structure.get_chemical_symbols()])
    indices = np.triu_indices(len(d), 1)
    v = np.sum(np.prod(Jij[np.stack(indices, axis=-1)], axis=-1) * d[indices])
    return np.log(np.log(v + 1) + 1)

I guess any structures which deliver different energy values would give different values with this function.

liamhuber commented 2 years ago

If you don't care about the key being meaningful, you can just hash the structure. I've done this before. I don't remember if I was able to pass the structure object directly or if I had to concatenate the cell, positions, and species all together as a big string (or similar) first.

samwaseda commented 2 years ago

Yeah I just want to make sure that the tag doesn't change for changing the order etc. I mean, as long as physics is the same, I want it to carry the same tag.

liamhuber commented 2 years ago

Yeah I just want to make sure that the tag doesn't change for changing the order etc. I mean, as long as physics is the same, I want it to carry the same tag.

That's a tall order...need to do all sorts of symmetry checks and stuff, yeah? Also pristine bulk cell 1x1x1 has the same physics as a pristine bulk cell 9x9x9, so you'd need to decide how to handle that. Once you have such a checker that identifies "equivalent" structures, I guess you could still just hash the canonical representation of that structure, right? I think the checker is the hard part.

pmrv commented 2 years ago

Yeah I just want to make sure that the tag doesn't change for changing the order etc. I mean, as long as physics is the same, I want it to carry the same tag.

Just sort the positions and ensure the cell is a right-handed coordinate system before hashing. Hashing has the advantage that it's easy to follow rather than some bespoke formulae we might come up with.

Also pristine bulk cell 1x1x1 has the same physics as a pristine bulk cell 9x9x9, so you'd need to decide how to handle that. Once you have such a checker that identifies "equivalent" structures, I guess you could still just hash the canonical representation of that structure, right? I think the checker is the hard part.

I would not make supercells hash to the same key as the unit cell. From a computational point these are different structures and some code could even reasonable return different energies for them (k-point sampling, etc.).

samwaseda commented 2 years ago

ok how does hashing actually work? Do you have a specific example?

samwaseda commented 2 years ago

I would not make supercells hash to the same key as the unit cell. From a computational point these are different structures and some code could even reasonable return different energies for them (k-point sampling, etc.).

I agree.

I actually didn't talk about the motivation, but what motivated me to make this was because I was launching some calculations with grain boundaries in DFT, where I distribute solute atoms randomly. When the number of solute atoms is low, it's possible that I do the same calculation multiple times, especially if I only enumerate the boxes. With the function I proposed above, I wanted to avoid such a situation. Therefore, actually we need the distances between the atoms, instead of absolute positions.

pmrv commented 2 years ago

Use the builtin hashlib. You'll need to feed it raw bytes, but converting strings and numpy arrays to bytes should be easy.

samwaseda commented 2 years ago

So following @pmrv and @liamhuber's comments, I wrote the following function:

def struct_to_tag(structure):
    d = structure.get_distances_array(structure.positions, structure.positions)
    indices = np.triu_indices(len(d), 1)
    d = np.log(d[indices])
    c = structure.g et_chemical_symbols()[np.stack(indices, axis=-1)]
    d_round = np.round(d, decimals=8)
    Jij = [hash(''.join(np.sort(cc))) for cc in c]
    E = np.sum(Jij * d_round)
    cell = np.round(structure.cell, decimals=8)
    arr = tuple(structure.pbc) + tuple(cell.flatten()) + (E, )
    return hash(arr)

Test:

axis = [2, 0, 1]
plane = [0, 1, 0]
sigma = 5
bulk = pr.create.structure.bulk('Fe', cubic=True)
structure = pr.create.structure.aimsgb.build(axis=axis, plane=plane, sigma=sigma, initial_struct=bulk)
print(struct_to_tag(structure))
equivalent_atoms = structure.get_symmetry()['equivalent_atoms']
indices = np.where(equivalent_atoms == 0)[0]
structure[indices[0]] = 'Ni'
first_tag = struct_to_tag(structure)
print(first_tag)
structure[indices[0]] = 'Fe'
structure[indices[1]] = 'Ni'
second_tag = struct_to_tag(structure)
print(second_tag)
print(first_tag == second_tag)

Output:

-8825262304291236675
1241634909506372990
1241634909506372990
True

pmrv commented 2 years ago

You'll have to use the proper hashlib and not the builtin function hash. The latter is just a representation of the memory location where an object lives in the interpreter.

Is there any reason why you take the log of the distances?

EDIT:

I would just:

prepare a bytesarray to hold the data,
normalize the cell somehow (axes in order of length maybe, respecting a right hand coordinate system)
put the raw numbers in the bytearray,
sort the atoms by distance to the origin,
append the positions (or distances like you do above) to the bytearray,
append the chemical element tags to the bytearray,
pass the whole array to hashlib.sha256().update(bytearray).hexdigest()

That should roughly guarantee rotation (cell normalization) and permutation (atom sorting) invariance. If you want to have proper invariances encoded in the hash, you essentially need to go to machine learning descriptors (with all the added complexity). DScribe e.g. has an implementation of MBTR which is a sets descriptors on a full structure. Overkill though, if you just want to generate job names...

samwaseda commented 2 years ago

log is in order to kill the linearity of the distance summation, i.e. if there are two pairs of atoms with the distance d, and another pair of atoms with the distance 2d, they should not deliver the same value by summation. This is easily done with log because log(2d) != 2*log(d).

pmrv commented 2 years ago

log is in order to kill the linearity of the distance summation, i.e. if there are two pairs of atoms with the distance d, and another pair of atoms with the distance 2d, they should not deliver the same value by summation. This is easily done with log because log(2d) != 2*log(d).

Your E term is basically a pair potential and they are known to be unable to differentiate general point clouds. Anything you construct out of purely pair (or even triple) contributions cannot give you a unique map into "structure space".

pmrv commented 2 years ago

Also see my sneak edit above.

samwaseda commented 2 years ago

Can you make an example where two different structures deliver the same value of E?

pmrv commented 2 years ago

Two pairs at distance d and one pair at distance d^2 by the same logic as your post above, but see this paper why it's not possible in general.

pmrv commented 2 years ago

Fair point: the linked paper discussed atom centered descriptors, which is not what you're doing, but iirc there's a citation in there to a math paper that shows that just the (full) list of distances is not enough to reconstruct a general point cloud. Therefore any function you define on the list of distances cannot be injective.

samwaseda commented 2 years ago

Anyway we cannot make any injective mapping, because we are talking about reducing the structure into a few characters. So the mathematical incompleteness means very little here. As long as there is no physically meaningful case in which two distinct structures deliver exactly the same value of E, it's not more dangerous than relying on hashlib.

And with this in mind: what's the physically meaningful situation in which there's a pair at d and another pair at d^2?

pmrv commented 2 years ago

Anyway we cannot make any injective mapping, because we are talking about reducing the structure into a few characters. So the mathematical incompleteness means very little here. As long as there is no physically meaningful case in which two distinct structures deliver exactly the same value of E, it's not more dangerous than relying on hashlib.

Imo, hashing has the advantage that it's "stupid" code. Easily read and most likely won't have to change in the future. If we start tinker with clever math formulations, which may or may not turn out to be insufficient in the future. Once that happens we have to change it, which means new "keys" clash with old ones, etc.

And with this in mind: what's the physically meaningful situation in which there's a pair at d and another pair at d^2?

I don't know, it seems as relevant to me as your example at 2d. But I imagine both cases easily appear in dimer and murnaghan calculations.

pyiron / pyiron_atomistics

Translate structure into a unique key #698