Closed schallerdavid closed 2 years ago
Thanks! Pinging @dominiquesydow for visibility!
Just running a script that checks if all chain/alternate model combinations reported in KLIFS are also present in the pdb files. Seems like there are quite a few mismatches.
Hi @schallerdavid, cool thanks!
I will take a closer look next week but I think we can include this workaround also in the master
version.
We could call your _correct_errors
method as part of the _standardize_dataframe
method (for structure DataFrames only).
https://github.com/volkamerlab/opencadd/blob/8a03435dc2c3093b9f67e251aa0f57149e2c1328/opencadd/databases/klifs/core.py#L117
Will make a suggestion next week!
This is a systematic error, do not merge, since many structures are affected. This branch corrects all human ABL1 and EGFR structures, which I need right now.
Hi @schallerdavid,
In order to avoid maintaining parallel versions of opencadd
I suggest to integrate your manual curation (with a clear documentation/docstrings) into the master
branch, while the user can still decide if they want to turn the corrections on or not (default without curation).
I suggest the following changes. Disclaimer: I did not test the code, might be bugs in there.
Move the current _correct_errors()
addition from opencadd.databases.klifs.remote.Structures
to its parent class opencadd.databases.klifs.core.BaseProvider
(were we clean DataFrames).
That way you will be able to reach the remote AND local modules.
Rename _correct_errors()
to e.g. _drop_manually_curated_structures()
. Having the word "structures" in there will help us with code readability lateron.
Instead of repeating the same code block for each PDB ID, I suggest to add a method like:
def _drop_manually_curated_structure(structures, pdb_id, chain, alternate_model):
# Drop false structure
structure_ix_to_be_dropped = structures[
(structures["structure.pdb_id"] == pdb_id)
& (structures["structure.chain"] == chain)
& (structures["structure.alternate_model"] == alternate_model)
].index
structures = structures.drop(structure_ix_to_be_dropped)
# Update alternate model ID from letter to "-" if only one alternate model left
# after dropping false structure
structures_remaining = structures[
(structures["structure.pdb_id"] == pdb_id)
& (structures["structure.chain"] == chain)
]
if len(structures_remaining) == 1:
structure_ix_to_be_updated = structures_remaining.index
structures.loc[
structure_ix_to_be_updated, "structure.alternate_model"
] = "-"
return structures
In _drop_manually_curated_structures()
you can now add a list of false structure identifiers (PDB ID, chain , alternative model):
def _drop_manually_curated_structures(structures):
# This list could also be exposed to an external data file, e.g.
# opencadd/opencadd/databases/klifs/data/manually_curated_structures.csv
false_structures = [
("3cs9", "A", "B"), ("3cs9", "C", "B"), ("3cs9", "D", "B"), ("5cnn", "B", "B"), ...
]
for false_structure in false_structures:
structures = self._drop_manually_curated_structure(structures, *false_structure)
return structures
Add and use the flag manual_curation=False
in _standardize_dataframe()
after L205. That way you do not have to repeat this correction everywhere we return a structure.
if manual_curation:
structures = self._drop_manually_curated_structures(structures)
Expose the manual_curation=False
flag to all by_*
and all_structures
methods in the class opencadd.databases.klifs.core.StructuresProvider
. Will need updates in the remote/local pendant opencadd.databases.klifs.remote/local.Structure
classes.
Let me know what you think. Happy to also have a video call on this to go through these steps and discuss if we want this and if so who is doing it.
Note: All this is useful ONLY in case, curation in KLIFS itself will take a long time. So maybe let's see what the KLIFS maintainers say about this.
Thanks for your feedback @dominiquesydow,
I will send the email to Albert today including a script to detect the erroneous structures. I will get back to you once I get an answer on how fast they would be able to correct KLIFS. Would be happy to chat with you afterward.
Leaving an update here from offline discussions: We are waiting for KLIFS to update the database and will not integrate the workaround from this PR into the master
.
@schallerdavid can we close this PR?
Yes, since this branch will never be merged. However, the altlocs are still an issue in the current KLIFS db.
Description
I recently found an error in the KLIFS database concerning the structure 3cs9. In KLIFS this structure has alternate models for chains A, B, C and D. However, in the PDB file there are only alternate models for chain B. This error interferes with my KinoML modeling pipeline, in which I heavily rely on KLIFS data. In this PR I correct this error by adding a new function to the klifs module and I will use this branch until KLIFS was updated (already messaged Albert about it).
This is a systematic error in KLIFS affecting many structures. I corrected those now for human ABL1 and EGFR x-rays, so I can work on these kinases for now. I will report this next week to Albert.
Status