Is there a way to transform a classical xyz file to the extended xyz format your are using in reference_data/inputs ? [question]

lab-cosmo / librascal

A scalable and versatile library to generate representations for atomic-scale learning

https://lab-cosmo.github.io/librascal/

GNU Lesser General Public License v2.1

80 stars 17 forks source link

Is there a way to transform a classical xyz file to the extended xyz format your are using in reference_data/inputs ? [question] #409

Closed UnixJunkie closed 2 years ago

UnixJunkie commented 2 years ago

Hello, It seems you are using an extended xyz file format. The second line for each molecule is a comment with some Lattice specification and other information. Is there a tool or function to automatically create such files given a "classic" xyz file? I am interested in a non repeating lattice (this is not a crystal, just an isolated molecule), big enough to hold the whole molecule, plus some margin on all axes, I guess. Thanks a lot, F.

UnixJunkie commented 2 years ago

Alternatively, if there is an automatic way to generate the kind of json files you have in reference_data/inputs, that might be similarly useful.

max-veit commented 2 years ago

Hello, the easiest way to do this is probably using ASE, just read in the file and write it out to a different filename; ASE uses the extended xyz format by default when writing files of the .xyz extension. As for the lattice, I think it's automatically padded to cell extent + 10 Å, but you might want to verify this.

The json files in the reference data are only for the pure C++ implementation (when you need to run without Python), so I don't think you need those.

UnixJunkie commented 2 years ago

An 'ase.io.read' followed by an 'ase.io.write' does not create the missing unit cell. There is this kind of comment which was added though, for each molecule:

Properties=species:S:1:pos:R:3 CHEMBL405398_1=T pbc="F F F"

Luthaf commented 2 years ago

You can also add the cell/lattice manually when loading your data:

frames = ase.io.read("your-file.xyz", ":")
for frame in frames:
    frame.cell = [100, 100, 100] # set this to something big enough
    frame.positions[:] += 50 # center the atoms in the cell
    frame.pbc = [False, False, False] # disable periodic boundary conditions

From here, you can use ase.io.write() to write an extended XYZ file with this cell information; or pass the frames directly to librascal.

UnixJunkie commented 2 years ago

Thanks a lot, I'll try this and let you know how it goes. Is there a users' mailing list for librascal? Maybe the bugtracker is not the best place for my beginner's questions.

UnixJunkie commented 2 years ago

One problem if I pass each frame directly to librascal is that the number of soap features will vary. While, I would like all my molecules to have the same number of SOAP features (though I don't know in advance what the dimensionality should be). I'll try working on a bigger computer, so that all molecules and their SOAP features can fit in memory.

ceriottm commented 2 years ago

You can specify manually the pool of chemical elements. Can't remember the syntax off the top of my head, but I believe there are examples. Something like global_species: [1,3, ....]

UnixJunkie commented 2 years ago

Even if I pass the global_species parameter to SphericalInvariants, the number of SOAP features is still varying:

# those are (num_atoms, num_SOAP_features) of the molecules I am reading in
Data matrix: (29, 2520)
Data matrix: (48, 3528)
Data matrix: (36, 2520)
Data matrix: (27, 2520)
Data matrix: (79, 2520)
Data matrix: (80, 2520)
Data matrix: (152, 3780)
Data matrix: (144, 1512)

Nb, not all molecules are with the same chemical composition; I am not working with frames from an MD simulation. Just distinct isolated molecules.

Luthaf commented 2 years ago

Can you share your full input & hyper-parameters?

Is there a users' mailing list for librascal? Maybe the bugtracker is not the best place for my beginner's questions.

That's fine for now, we don't have a lot of traffic. Otherwise, the discussion page on this repository would also be a good place for questions: https://github.com/lab-cosmo/librascal/discussions/categories/q-a

UnixJunkie commented 2 years ago

I'll share my test code as a PR.

UnixJunkie commented 2 years ago

Cf. https://github.com/lab-cosmo/librascal/pull/410

UnixJunkie commented 2 years ago

You should be able to run it and get the following output:

./soap_test.py > test.out
Data matrix: (17, 2520)
Data matrix: (20, 1512)
Data matrix: (12, 2520)
Data matrix: (14, 2520)
Data matrix: (20, 2520)
Data matrix: (13, 1512)
Data matrix: (18, 2268)
Data matrix: (20, 2520)
Data matrix: (15, 2520)
Data matrix: (25, 1512)
Data matrix: (19, 1512)
Data matrix: (19, 1512)
Data matrix: (21, 1512)
Data matrix: (18, 2520)
Data matrix: (18, 2520)
Data matrix: (15, 1512)
Data matrix: (21, 1512)
Data matrix: (20, 1512)
Data matrix: (13, 2520)
Data matrix: (20, 2520)

UnixJunkie commented 2 years ago

While I understand the number of atoms is varying, I don't understand why this is the case for the number of SOAP features.

UnixJunkie commented 2 years ago

I don't understand how I could compare two atoms using a kernel if the atoms are not encoded with vectors of the same length.

UnixJunkie commented 2 years ago

solved thanks to feedback from the experts