project-gemmi / gemmi

macromolecular crystallography library and utilities
https://project-gemmi.github.io/
Mozilla Public License 2.0
223 stars 45 forks source link

reading/writing CONECT records? #188

Closed rimmartin closed 1 year ago

rimmartin commented 2 years ago

Any interest in CONECT's? They are serial based. Maybe a starting point is saving the serial numbers back on the Atom struct?

HETATM's (water excluded) and what else get these explicit connections?

https://github.com/project-gemmi/gemmi/blob/master/include/gemmi/pdb.hpp#L444

Some writers repeat CONECT serial pairs a bond-order number of times to represent the order

wojdyr commented 2 years ago

CONECT records are not particularly useful and they don't have equivalent in mmCIF, so I planned to leave them out. What would you use it for?

rimmartin commented 2 years ago

Hi @wojdyr, work related. Treated similar to mol2 & sdf bond information. Moe for example writes out CONECT's and then read in to try to insure the connectivity is identical.

Pymol will pick up the redundancy and show bond order by number of bars between to atoms.

wojdyr commented 1 year ago

There is at least one more company that's adding CONECT records to pdb files written by gemmi, so I see that it'd be good to have it implemented. Although it's not clear to me what's the best way to store these records in gemmi. Where the information for CONECT comes from in your work?

rimmartin commented 1 year ago

Added a Conect type to Connection::Type enumeration and bond order property to Connection https://github.com/quantumbio/gemmi/blob/master/include/gemmi/metadata.hpp#L259 https://github.com/quantumbio/gemmi/blob/master/include/gemmi/metadata.hpp#L266

Added a conect_records=false to PdbWriteOptions so that currently by default CONECT's are not written https://github.com/quantumbio/gemmi/blob/master/include/gemmi/to_pdb.hpp#L23C8-L23C22

As the serial is incremented in to_pdb.cpp https://github.com/quantumbio/gemmi/blob/master/src/to_pdb.cpp#L265 I record it back to the atom's property at https://github.com/quantumbio/gemmi/blob/master/src/to_pdb.cpp#L289

With these changes the CONECT info can be stored by the calling app's code

     pdbWriteOptions.conect_records=true;
...
        residues->setIndex(atomsA->getResidueIndex());
        if(residues->getName().compare("HOH")==0)continue;
        connection.partner1.atom_name=atomsA->getName();
        connection.partner1.chain_name=residues->getStrandName();
        connection.partner1.res_id.name=residues->getName();
        if(!atomsA->getAlternateLocation().empty())connection.partner1.altloc=atomsA->getAlternateLocation().at(0);
        char ic=(!residues->getInsertionCode().empty()) ? residues->getInsertionCode().at(0): ' ';
        connection.partner1.res_id.seqid=gemmi::SeqId(residues->getSequence(), ic);

        residues->setIndex(atomsB->getResidueIndex());
        if(residues->getName().compare("HOH")==0)continue;
        connection.partner2.atom_name=atomsB->getName();
        connection.partner2.chain_name=residues->getStrandName();
        connection.partner2.res_id.name=residues->getName();
        if(!atomsB->getAlternateLocation().empty())connection.partner2.altloc=atomsB->getAlternateLocation().at(0);
        ic=(!residues->getInsertionCode().empty()) ? residues->getInsertionCode().at(0): ' ';
        connection.partner2.res_id.seqid=gemmi::SeqId(residues->getSequence(), ic);
        if(atomsA->getAtomicNumber()==16 && atomsB->getAtomicNumber()==16)
        {
            connection.type=gemmi::Connection::Disulf;
            structure.connections.push_back(connection);
        }
        else if(atomsA->isMetal() || atomsB->isMetal()){
            connection.type=gemmi::Connection::MetalC;
            structure.connections.push_back(connection);
        }
        else if(atomsA->getResidueIndex()!= atomsB->getResidueIndex()){
            connection.type=gemmi::Connection::Covale;
            structure.connections.push_back(connection);
        }
        else{
            connection.type=gemmi::Connection::Conect;
            connection.order=bonds->getOrder();
            structure.connections.push_back(connection);                                
        }

for appropriate bonds(the else case after the other enumeration types are decided) the new Connection::Type::Conect is set.

Then in to_pdb.cpp at https://github.com/quantumbio/gemmi/blob/master/src/to_pdb.cpp#L344

    if (opt.conect_records) {
      std::unordered_map<int, std::vector<int>> conectmap;
      for (const Connection& con : st.connections)
        if (con.type == Connection::Conect) {
          const_CRA cra1 = st.models[0].find_cra(con.partner1, true);
          const_CRA cra2 = st.models[0].find_cra(con.partner2, true);
          // In special cases (LINKR gap) atoms are not there.
          if (!cra1.residue || !cra2.residue)
            continue;
          conectmap[cra1.atom->serial].push_back(cra2.atom->serial);
          conectmap[cra2.atom->serial].push_back(cra1.atom->serial);
          if(con.order>=2){
            conectmap[cra1.atom->serial].push_back(cra2.atom->serial);
            conectmap[cra2.atom->serial].push_back(cra1.atom->serial);
          }
          if(con.order>=3){
            conectmap[cra1.atom->serial].push_back(cra2.atom->serial);
            conectmap[cra2.atom->serial].push_back(cra1.atom->serial);
          }
        }
      for(auto ele : conectmap){

          snprintf_z(buf, 82, "CONECT%5s",
                encode_serial_in_hybrid36(ele.first).data());
          int columnOffset = 11;
          for(int c : ele.second){
            snprintf_z(buf+columnOffset, 82, "%5s",
                  encode_serial_in_hybrid36(c).data());
            columnOffset+=5;
          }
          buf[columnOffset] = '\n';
          os.write(buf, columnOffset+1);
      }
    }

is the CONECT output where the serial numbers are gathered for order redundancy and for both directions.

For example N(serial=1), HN2(serial=2)& HN3(serial=3)

CONECT    1    2    3
CONECT    2    1
CONECT    3    1

occur.

If a conect is a double bond order for example C1(serial=1) & C2(serial=2)

CONECT    1    2    2
CONECT    2    1    1

Other software reading this pdb may pick this up and know the order for example pymol displays the double bond.

The calling app's code decides on what CONECT's are going to be defined; skip aa and dna standards and HOH and fill in nonstandard bonds.

Would be helpful if someone from the other company can join this discussion with you. We are running successfully with this implementation but can adjust to any changes you might decide

wojdyr commented 1 year ago

I've been thinking about different ways of storing data for CONECT:

  1. Similarly to how it's stored in mmCIF (PDB NextGen archive contains category chem_comp_bond which was added because people were missing the information from CONECT). This would mean that gemmi would determine CONECTs from _chem_comp_bond and _struct_conn. But since in computational chemistry, unlike in wwPDB, CONECT provides also bond order information, and kekulization is out of the scope of gemmi, this wouldn't be particularly useful.

  2. As you proposed in the comment above and in #274, i.e. using AtomAddress.

  3. Using serial numbers only. The calling app would need to do something similar to:

    gemmi::assign_serial_numbers(structure, /*numbered_ter=*/true);
    for (...)
      structure.add_conect(serial1, serial2, bond_order);
    PdbWriteOptions options;
    options.preserve_serial = true;
    options.conect_records = true;
    structure.write_pdb(path, options);

    Would 3 work for you?

rimmartin commented 1 year ago
  1. would work as long as the serial numbers applied in writing the ATOM or HETATM records are corresponding and available at the right time?
rimmartin commented 1 year ago

And concerning leaving standard amino's, dna's without explicit CONECT's? 3. would be for only nonstandard connections/bonds? When an app creates all connections/bonds they can be time consuming to parse and get them all related back to the atoms again; as systems go toward 150,000 atoms and above

rimmartin commented 1 year ago

And I agree with keeping it readily compatible with mm cifs. We are applying those thru gemmi more and more

wojdyr commented 1 year ago

I tentatively implemented 3. CONECT records will contain whatever you add with add_conect(). It's symmetric, so it adds A-B and B-A. There is also add_conect_one_way(), but it's probably not needed. Let me know if it's not a particularly convenient solution.

rimmartin commented 1 year ago

Had to think about it from different perspectives.

My way comes from the situation that our backbone(our gemmi::Structure) existed and grew since 2007. And I'm bridging over to gemmi::Structure.

As I rethought about it what you are implementing is correct for gemmi and new users which is the better way to go. An app created since gemmi's existence can just get going directly using gemmi::Structure. This might be the case for the other company you mentioned depending when they started coding.

If my company ever gives me the time I plan to use gemmi::Structure inside our backbone classes:-) And extend(derive) the struct's and vectors for our properties designed for all the chemistry and algorithms we've implemented. It is actually quite doable because backbones are backbones and relating all the atoms, residues and chains follows.

I'll reclone your branch and begin testing(bridging CONECT's between our backbone and gemmi::Structure on our side)

Thank you

rimmartin commented 1 year ago

I've sync'ed with the latest gemmi and switched my app code to use your impl for CONECT's. Running successfully! Thank you

wojdyr commented 1 year ago

Good to know that it worked.

To make it complete, I added reading CONECT records.

Then, since I didn't want to copy CONECTs by default when PDB is read and written, I changed the default value of option PdbWriteOptions::conect_records to false. So if you have in your code:

options.preserve_serial = true;

you'll need to add also:

options.conect_records = true;