project-gemmi / gemmi

macromolecular crystallography library and utilities
https://project-gemmi.github.io/
Mozilla Public License 2.0
231 stars 46 forks source link

Issues with `gemmi::remove_ligands_and_waters` #290

Closed FilomenoSanchez closed 11 months ago

FilomenoSanchez commented 11 months ago

Hi @wojdyr, thanks again for the package and taking the time to take care of it. I'm running into issues when i use gemmi::remove_ligands_and_waters both in C++ and python, using version 0.6.3. When I read in the cif file I'm sending attached (more info about how I got this file at the end), I seem to be unable to remove ligands and waters. For context, this structure has two chains, chain A has 348 waters and chain B only consists of 2 ligands. Here's the code:

import gemmi
struct = gemmi.read_structure('./5a3h.cif')
print(struct[0][0]) # prints <gemmi.Chain A with 648 res>
print(struct[0][1]) # prints <gemmi.Chain B with 2 res>

# Now we remove ligands and waters. 
struct.remove_ligands_and_waters()

# Expected output now would be chain A to have 300 residues and chain B to be empty
print(struct[0][0]) # prints <gemmi.Chain A with 648 res>
print(struct[0][1]) # prints <gemmi.Chain B with 2 res>

If I inspect the residues in chain A, I can see water residues have correctly set the flag is_water to True. For instance:

import gemmi
struct = gemmi.read_structure('./5a3h.cif')

last_res = struct[0][0][-1]
print(last_res) # prints <gemmi.Residue 1248(HOH) with 1 atoms>
print(last_res.is_water()) # prints True

Also, gemmi::remove_waters seems to work fine (although of course it does not remove ligands, which is what I want):

import gemmi
struct = gemmi.read_structure('./5a3h.cif')
print(struct[0][0]) # prints <gemmi.Chain A with 648 res>
print(struct[0][1]) # prints <gemmi.Chain B with 2 res>

# Now we ONLY remove waters. 
struct.remove_waters()

# Expected output now would be chain A to have 300 residues and chain B to be empty
print(struct[0][0]) # prints <gemmi.Chain A with 300 res>
print(struct[0][1]) # prints <gemmi.Chain B with 2 res>

You can find the file here: 5a3h.zip. Regarding the origin of this file, it was created by writing a CIF file from a mmdb::Manager, here's the code:

int writeCIFASCII(int imol, const std::string &file_name) { 
   const auto mol = get_mol(imol);
   int ierr = mol->WriteCIFASCII(file_name.c_str());
   return ierr;
}

This issue is reproducible with all files generated in this way. I'm not sure whether this is a gemmi issue or if there's something wrong with the files produced with the above code, but visually inspecting the contents of the file I can't see anything wrong with them. Also the fact that residues that return True when running is_water() are kept even after running gemmi::remove_ligands_and_waters might indicate that there's something wrong with this function.

wojdyr commented 11 months ago

This cif file from MMDB is different from cif files from the PDB, and not conforming to the spec, in many ways. The particular problem here is that _atom_site.label_entity_id is set to 1 for all atoms. In mmCIF files, each so-called entity is explicitly assigned one of the types: polymer, branched, non-polymer, water. Here, this is missing, but since all residues are explicitly assigned the same entity_id, it means that they are all of the same type – gemmi assumes that this type is polymer. I'd need to add a sanity check to detect when entity_id is a dummy value.

As a workaround, you can call:

structure.add_entity_types(overwrite=True)

which re-assigns entity types. Then removing ligands/waters should work. The overwrite option was fixed recently, it might not work in 0.6.3.

Anyway, it'd be better to not use mmCIF files like this.

FilomenoSanchez commented 11 months ago

Thanks for the information. I agree that ideally it would be better to not use such mmCIF files but I have no other option at the moment. Regarding the suggested fix structure.add_entity_types(overwrite=True), is this in anyway similar to gemmi::setup_entities from v0.6.2? We are currently using gemmi::setup_entities with the newly created structures but the problem persists. Also the approach you suggested in pumble mmdb->gemmi->mmCIF->gemmi seems not to solve the issue. It might be that we need to wait until serialization format is available...

FilomenoSanchez commented 11 months ago

Hi @wojdyr, sorry to bother you with this again. I've got another file (also originating from mmdb) where the _atom_site.label_entity_id is set to 1 for atoms in residues, 2 for waters and 3 for the ligand, link to the file here. In this case, gemmi::remove_ligands_and_waters can remove the waters but it seems unable to remove the ligands? Any idea why?

wojdyr commented 11 months ago

add_entity_types(overwrite=True) overwrites entity types that were assigned before. setup_entities() calls add_entity_types(), but without the overwrite flag, which doesn't change the types that were assigned previously.

I changed now how mmcif is read: if entity types are not specified explicitly in the file, they are not assigned automatically inside read_structure(). The PDB has conventions in its files that distinguish polymers from non-polymers and branched entities: label_seq_id is null for all residues except polymers, and each non-polymer residue has separate label_asym_id. But other programs may not observe these conventions, so let's not rely on it.