lhatsk / AlphaLink

AlphaLink: Integrating crosslinking MS data into OpenFold
Apache License 2.0
64 stars 17 forks source link

Over-weight of crosslinking data #11

Open Yan-Yan-2020 opened 1 year ago

Yan-Yan-2020 commented 1 year ago

Hi,

How can we figure out the over weight problem for crosslinking data? i noticed if there are lots of crosslinking restraints for one sequence, the final models looks like over-constrained and some well-folded domains looks unstructured.

Thanks. Yan

grandrea commented 1 year ago

you can increase -Neff to downweight the crosslinks (see the influence of this in the supplementary figures of the alphalink paper) or remove msa subsampling altogether. Alternatively, you can change the fdr number on the crosslinking restraints or flatten the shape of the distribution. Finally, you can run multiple times with subsets of restraints. I also encourage you to carefully look at the crosslinking MS data to ensure error thresholding is done properly.

Yan-Yan-2020 commented 1 year ago

Great! thank you so much! I'm trying these ways to see how it looks.

Thanks. Yan

Yan-Yan-2020 commented 1 year ago

Hi, Running multiple times with subsets of restraints would be a better solution in my case. Do you have any detailed workflow on it? will you use the restrained model as a new input for next subset of restraints? How do you filter the restraints as a subset?

Thanks. Yan

lhatsk commented 1 year ago

Hi, This workflow is not implemented at the moment. What you could do is shuffle your links once they are loaded and pick a subset. E.g., like this (untested!):

np.random.shuffle(links)
subset = 0.8
links = links[:int(n * subset)]

Should be inserted here: https://github.com/lhatsk/AlphaLink/blob/main/predict_with_crosslinks.py#L292

To have more control over the subsets, it might make sense to partition beforehand and just use the newly created CSV files, if you want to filter/ iteratively add restraints.