Open Amelie-Schreiber opened 11 months ago
@Amelie-Schreiber This is a fascinating application! Sorry for my late reply. Averaging makes sense (or robust averaging such as median). I also tried robust linear regression. So, to compute (m,b) you can leverage RANSAC for instance: https://scikit-learn.org/stable/auto_examples/linear_model/plot_robust_fit.html. I had some experiments in my paper, which did not show a significant difference. However, your problem is completely different. Potentially it might help.
Having a large amount of samples could also help. Usually large dimension, small sample size can lead to unstability.
Keep me posted, I'm curious. :)
Hi, I am using part of your
topology.py
script to calculate the persistent homology dimension for the embeddings of a protein language model (ESM-2) and the dimension estimate show low error but seems to fluctuate quite a lot when running it multiple times on the same protein. Would it be beneficial to run multiple times and average? What other strategies might stabilize the estimates? Below is my current script:Averaging does seem to stabilize some, but not as much as I would like. Any feedback on using your code for this purpose would be greatly appreciated!