vuqv / Entanglement_database

Entanglement information from AlphaFold structures
0 stars 1 forks source link

QC for entanglement #5

Closed vuqv closed 1 month ago

vuqv commented 2 months ago

After control for quality of overall structure quality. It is needed to control for entanglement quality. The reason for that is for example, sequence with 1000's residues will have a very high quality overall, but some region with low quality and AF will add a disodered region for that. This can be a loop, then GLN algorithm will identified it is entanglement.

This is the second step of control for quality of entanglement, after #4 .

How to to this? There are two set of criteria that Quyen and Ed/Ian do not agree with each other, then test both to count for the False rate in each set of criteria. Which criteria gives lower false rate is better.

vuqv commented 2 months ago

Ed/Ian criteria

  1. np.mean(pLDDT of i, j and 3 residues along primary structure) > 60
  2. np.mean(pLDDT of k and 3 residues along primary structure) > 60
  3. np.mean(pLDDT of all residues within 4.5 Angstrom of heavy atom of k) > 60
  4. All 3 criteria must be satisfied to include an entanglement
vuqv commented 2 months ago

Quyen, I believe your criteria are too strict.

For each entanglement, we have the result [(i, j, [k1, k2, ..., kn])]. Since i and j are pairs of residues in close contact, they are single values. However, crossing events can occur multiple times, so we can have multiple k values.

Current Criteria:

  1. np.mean(pLDDT of region i-j) >= 70
  2. np.mean(k ± 3) >= 70
  3. pLDDT per-residue of [i, j, *[k]] >= 70 I believe these criteria are too strict. Specifically,
    • Criteria 1: the loop can compose of disorder regions, this will lead to average pLDDT of the loop is low. Then this criteria is not reasonable.
    • Criterion 3, [i, j], and [k] should be separated. i and j must have per-residue pLDDT > 70 because if any of those residues forming contact is not confident, it will significantly affect whether the loop is closed or not. However, for crossing residues, many crossing residues might be present. Rejecting an entanglement if any crossing residue is of low quality will remove many potential entanglements.

Proposed Modification:

For the list of crossing residues, remove only those residues with low quality. If the remaining list is empty, then remove that entanglement. If the remaining list is not empty, it suggests that the crossing event can still be real.

Quyen's Revised Criteria (currently used):

  1. per-residue pLDDT of i >= 70
  2. per-residue pLDDT of j >= 70
  3. [plddt(x) for x in list_crossing_residues if plddt(x) > 70] is not empty
  4. All 3 criteria must be satisfied to include an entanglements

By implementing these changes, we can maintain the integrity of the evaluation while allowing for more realistic entanglement detection.

vuqv commented 1 month ago

As the results, we look at randomly selected 40 entanglements, 10 for each category:

  1. 10 entanglements that Both Ed/Ian and Quyen say they are valid
  2. 10 entanglements that both Ed/Ian and Quyen say they are invalid
  3. 10 entanglements that Ed/Ian says they are valid, and Quyen says they are invalid
  4. 10 entanglements that Ed/Ian says they are invalid, and Quyen says they are valid

For Ed/Ian criteria, the accuracy is ~52 % while Quyen criteria gives 80%

random_entanglement_selections.xlsx

vuqv commented 1 month ago

This has been solved! Good job Quyen and Ian