project-gemmi / gemmi

macromolecular crystallography library and utilities
https://project-gemmi.github.io/
Mozilla Public License 2.0
205 stars 42 forks source link

Issues with `gemmi::remove_ligands_and_waters` #290

Closed FilomenoSanchez closed 7 months ago

FilomenoSanchez commented 7 months ago

Hi @wojdyr, thanks again for the package and taking the time to take care of it. I'm running into issues when i use gemmi::remove_ligands_and_waters both in C++ and python, using version 0.6.3. When I read in the cif file I'm sending attached (more info about how I got this file at the end), I seem to be unable to remove ligands and waters. For context, this structure has two chains, chain A has 348 waters and chain B only consists of 2 ligands. Here's the code:

import gemmi
struct = gemmi.read_structure('./5a3h.cif')
print(struct[0][0]) # prints <gemmi.Chain A with 648 res>
print(struct[0][1]) # prints <gemmi.Chain B with 2 res>

# Now we remove ligands and waters. 
struct.remove_ligands_and_waters()

# Expected output now would be chain A to have 300 residues and chain B to be empty
print(struct[0][0]) # prints <gemmi.Chain A with 648 res>
print(struct[0][1]) # prints <gemmi.Chain B with 2 res>

If I inspect the residues in chain A, I can see water residues have correctly set the flag is_water to True. For instance:

import gemmi
struct = gemmi.read_structure('./5a3h.cif')

last_res = struct[0][0][-1]
print(last_res) # prints <gemmi.Residue 1248(HOH) with 1 atoms>
print(last_res.is_water()) # prints True

Also, gemmi::remove_waters seems to work fine (although of course it does not remove ligands, which is what I want):

import gemmi
struct = gemmi.read_structure('./5a3h.cif')
print(struct[0][0]) # prints <gemmi.Chain A with 648 res>
print(struct[0][1]) # prints <gemmi.Chain B with 2 res>

# Now we ONLY remove waters. 
struct.remove_waters()

# Expected output now would be chain A to have 300 residues and chain B to be empty
print(struct[0][0]) # prints <gemmi.Chain A with 300 res>
print(struct[0][1]) # prints <gemmi.Chain B with 2 res>

You can find the file here: 5a3h.zip. Regarding the origin of this file, it was created by writing a CIF file from a mmdb::Manager, here's the code:

int writeCIFASCII(int imol, const std::string &file_name) { 
   const auto mol = get_mol(imol);
   int ierr = mol->WriteCIFASCII(file_name.c_str());
   return ierr;
}

This issue is reproducible with all files generated in this way. I'm not sure whether this is a gemmi issue or if there's something wrong with the files produced with the above code, but visually inspecting the contents of the file I can't see anything wrong with them. Also the fact that residues that return True when running is_water() are kept even after running gemmi::remove_ligands_and_waters might indicate that there's something wrong with this function.

wojdyr commented 7 months ago

This cif file from MMDB is different from cif files from the PDB, and not conforming to the spec, in many ways. The particular problem here is that _atom_site.label_entity_id is set to 1 for all atoms. In mmCIF files, each so-called entity is explicitly assigned one of the types: polymer, branched, non-polymer, water. Here, this is missing, but since all residues are explicitly assigned the same entity_id, it means that they are all of the same type – gemmi assumes that this type is polymer. I'd need to add a sanity check to detect when entity_id is a dummy value.

As a workaround, you can call:

structure.add_entity_types(overwrite=True)

which re-assigns entity types. Then removing ligands/waters should work. The overwrite option was fixed recently, it might not work in 0.6.3.

Anyway, it'd be better to not use mmCIF files like this.

FilomenoSanchez commented 7 months ago

Thanks for the information. I agree that ideally it would be better to not use such mmCIF files but I have no other option at the moment. Regarding the suggested fix structure.add_entity_types(overwrite=True), is this in anyway similar to gemmi::setup_entities from v0.6.2? We are currently using gemmi::setup_entities with the newly created structures but the problem persists. Also the approach you suggested in pumble mmdb->gemmi->mmCIF->gemmi seems not to solve the issue. It might be that we need to wait until serialization format is available...

FilomenoSanchez commented 7 months ago

Hi @wojdyr, sorry to bother you with this again. I've got another file (also originating from mmdb) where the _atom_site.label_entity_id is set to 1 for atoms in residues, 2 for waters and 3 for the ligand, link to the file here. In this case, gemmi::remove_ligands_and_waters can remove the waters but it seems unable to remove the ligands? Any idea why?

wojdyr commented 7 months ago

add_entity_types(overwrite=True) overwrites entity types that were assigned before. setup_entities() calls add_entity_types(), but without the overwrite flag, which doesn't change the types that were assigned previously.

I changed now how mmcif is read: if entity types are not specified explicitly in the file, they are not assigned automatically inside read_structure(). The PDB has conventions in its files that distinguish polymers from non-polymers and branched entities: label_seq_id is null for all residues except polymers, and each non-polymer residue has separate label_asym_id. But other programs may not observe these conventions, so let's not rely on it.