volkamerlab / opencadd

A Python library for structural cheminformatics
https://opencadd.readthedocs.io
MIT License
89 stars 18 forks source link

KLIFS corrections #73

Closed schallerdavid closed 2 years ago

schallerdavid commented 3 years ago

Description

I recently found an error in the KLIFS database concerning the structure 3cs9. In KLIFS this structure has alternate models for chains A, B, C and D. However, in the PDB file there are only alternate models for chain B. This error interferes with my KinoML modeling pipeline, in which I heavily rely on KLIFS data. In this PR I correct this error by adding a new function to the klifs module and I will use this branch until KLIFS was updated (already messaged Albert about it).

This is a systematic error in KLIFS affecting many structures. I corrected those now for human ABL1 and EGFR x-rays, so I can work on these kinases for now. I will report this next week to Albert.

Status

codecov-io commented 3 years ago

Codecov Report

Merging #73 (8a03435) into master (8a03435) will not change coverage. The diff coverage is n/a.

:exclamation: Current head 8a03435 differs from pull request most recent head 3fe0423. Consider uploading reports for the commit 3fe0423 to get more accurate results

jaimergp commented 3 years ago

Thanks! Pinging @dominiquesydow for visibility!

schallerdavid commented 3 years ago

Just running a script that checks if all chain/alternate model combinations reported in KLIFS are also present in the pdb files. Seems like there are quite a few mismatches.

dominiquesydow commented 3 years ago

Hi @schallerdavid, cool thanks!

I will take a closer look next week but I think we can include this workaround also in the master version.

We could call your _correct_errors method as part of the _standardize_dataframe method (for structure DataFrames only). https://github.com/volkamerlab/opencadd/blob/8a03435dc2c3093b9f67e251aa0f57149e2c1328/opencadd/databases/klifs/core.py#L117

Will make a suggestion next week!

schallerdavid commented 3 years ago

This is a systematic error, do not merge, since many structures are affected. This branch corrects all human ABL1 and EGFR structures, which I need right now.

dominiquesydow commented 3 years ago

Hi @schallerdavid,

In order to avoid maintaining parallel versions of opencadd I suggest to integrate your manual curation (with a clear documentation/docstrings) into the master branch, while the user can still decide if they want to turn the corrections on or not (default without curation).

I suggest the following changes. Disclaimer: I did not test the code, might be bugs in there.

  1. Move the current _correct_errors() addition from opencadd.databases.klifs.remote.Structures to its parent class opencadd.databases.klifs.core.BaseProvider (were we clean DataFrames). That way you will be able to reach the remote AND local modules.

  2. Rename _correct_errors() to e.g. _drop_manually_curated_structures(). Having the word "structures" in there will help us with code readability lateron.

  3. Instead of repeating the same code block for each PDB ID, I suggest to add a method like:

      def _drop_manually_curated_structure(structures, pdb_id, chain, alternate_model):
    
          # Drop false structure
          structure_ix_to_be_dropped = structures[
              (structures["structure.pdb_id"] == pdb_id)
              & (structures["structure.chain"] == chain)
              & (structures["structure.alternate_model"] == alternate_model)
          ].index
          structures = structures.drop(structure_ix_to_be_dropped)
    
          # Update alternate model ID from letter to "-" if only one alternate model left
          # after dropping false structure
          structures_remaining = structures[
              (structures["structure.pdb_id"] == pdb_id)
              & (structures["structure.chain"] == chain)
          ]
          if len(structures_remaining) == 1:
              structure_ix_to_be_updated = structures_remaining.index
              structures.loc[
                  structure_ix_to_be_updated, "structure.alternate_model"
              ] = "-"
          return structures
  4. In _drop_manually_curated_structures() you can now add a list of false structure identifiers (PDB ID, chain , alternative model):

    def _drop_manually_curated_structures(structures):
    
        # This list could also be exposed to an external data file, e.g. 
        # opencadd/opencadd/databases/klifs/data/manually_curated_structures.csv
        false_structures = [
            ("3cs9", "A", "B"), ("3cs9", "C", "B"), ("3cs9", "D", "B"), ("5cnn", "B", "B"), ... 
        ]
    
        for false_structure in false_structures:
            structures = self._drop_manually_curated_structure(structures, *false_structure)
    
        return structures
  5. Add and use the flag manual_curation=False in _standardize_dataframe() after L205. That way you do not have to repeat this correction everywhere we return a structure.

    if manual_curation:
        structures = self._drop_manually_curated_structures(structures)
  6. Expose the manual_curation=False flag to all by_* and all_structures methods in the class opencadd.databases.klifs.core.StructuresProvider. Will need updates in the remote/local pendant opencadd.databases.klifs.remote/local.Structure classes.

Let me know what you think. Happy to also have a video call on this to go through these steps and discuss if we want this and if so who is doing it.

dominiquesydow commented 3 years ago

Note: All this is useful ONLY in case, curation in KLIFS itself will take a long time. So maybe let's see what the KLIFS maintainers say about this.

schallerdavid commented 3 years ago

Thanks for your feedback @dominiquesydow,

I will send the email to Albert today including a script to detect the erroneous structures. I will get back to you once I get an answer on how fast they would be able to correct KLIFS. Would be happy to chat with you afterward.

dominiquesydow commented 3 years ago

Leaving an update here from offline discussions: We are waiting for KLIFS to update the database and will not integrate the workaround from this PR into the master.

dominiquesydow commented 2 years ago

@schallerdavid can we close this PR?

schallerdavid commented 2 years ago

Yes, since this branch will never be merged. However, the altlocs are still an issue in the current KLIFS db.