project-gemmi / gemmi

macromolecular crystallography library and utilities
https://project-gemmi.github.io/
Mozilla Public License 2.0
212 stars 44 forks source link

[Feature Request] take the advantage of QCP that getting RMSD value without calculating rotation matrix #148

Closed NatureGeorge closed 2 years ago

NatureGeorge commented 2 years ago

As https://theobald.brandeis.edu/qcp/ said:

In the QCP method, the RMSD is first evaluated by solving for the most positive eigenvalue of the 4×4 key matrix using a Newton-Raphson algorithm that quickly finds the largest root (eigenvalue) from the characteristic polynomial. The minimum RMSD is then easily caclulated from the largest eigenvalue. If only the RMSD is required (i.e., the rotation matrix is not needed), this method is orders of magnitude faster than the most efficient competing methods.

QCP method has the potential helpful feature that gemmi has not implemented yet. I thought gemmi.calculate_superposition's current_rmsd=True argument is for that usage, but it is not.

Relevant lines of code are shown below:

https://github.com/project-gemmi/gemmi/blob/eaa71fc2d3a081291c1c1e29d724dc5d278a5d57/include/gemmi/qcp.hpp#L118-L119

https://github.com/project-gemmi/gemmi/blob/eaa71fc2d3a081291c1c1e29d724dc5d278a5d57/include/gemmi/qcp.hpp#L188-L190

https://github.com/project-gemmi/gemmi/blob/eaa71fc2d3a081291c1c1e29d724dc5d278a5d57/include/gemmi/qcp.hpp#L318

It is easy and straightforward to achieve this feature. Just deal with min_score (e.g. set min_score = 10000 when users just want the RMSD value).

wojdyr commented 2 years ago

OK, I added function calculate_rmsd_of_superposed_positions() Does this address your request? If yes, could you test it and check how much faster it is? I suppose the difference is not that big, but I didn't measure it.

NatureGeorge commented 2 years ago

Appreciate your work! I will check it.

NatureGeorge commented 2 years ago

I modified some code to test (see https://github.com/naturegeorge/gemmi/tree/qcp). You are right. In most of my test cases, there are no significant differences.

Here is an example:

import gemmi
polymer1 = gemmi.read_structure('5a42.cif.gz')[0]['A'].get_polymer()
polymer2 = gemmi.read_structure('4rtd.cif.gz')[0]['A'].get_polymer()
ptype = polymer1.check_polymer_type()

%timeit -n 1000 gemmi.calculate_superposition(polymer1, polymer2, ptype, gemmi.SupSelect.CaP).rmsd
# 10.5 ms ± 67.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit -n 1000 gemmi.calculate_superposition(polymer1, polymer2, ptype, gemmi.SupSelect.CaP, transform=False).rmsd
# 10.5 ms ± 60.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit -n 1000 gemmi.calculate_superposition(polymer1, polymer2, ptype, gemmi.SupSelect.All).rmsd
# 10.9 ms ± 79.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit -n 1000 gemmi.calculate_superposition(polymer1, polymer2, ptype, gemmi.SupSelect.All, transform=False).rmsd
# 10.9 ms ± 76.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
NatureGeorge commented 2 years ago

But I also run a unit test that assumes transform=False would improve efficiency and the test pass in most runs.

#!/usr/bin/env python

import unittest
import timeit
import gemmi
from common import full_path

class TestQCP(unittest.TestCase):

    def test_qcp_rmsd(self):
        polymer1 = gemmi.read_structure(full_path('5y36.cif.gz'))[0]['A'].get_polymer()
        polymer2 = gemmi.read_structure(full_path('5f9r.cif.gz'))[0]['B'].get_polymer()
        ptype = polymer1.check_polymer_type()

        self.assertAlmostEqual(
            gemmi.calculate_superposition(polymer1, polymer2, ptype, gemmi.SupSelect.CaP, transform=False).rmsd,
            gemmi.calculate_superposition(polymer1, polymer2, ptype, gemmi.SupSelect.CaP, transform=True).rmsd
        )
        self.assertAlmostEqual(
            gemmi.calculate_superposition(polymer1, polymer2, ptype, gemmi.SupSelect.All, transform=False).rmsd,
            gemmi.calculate_superposition(polymer1, polymer2, ptype, gemmi.SupSelect.All, transform=True).rmsd
        )

        def cal_cap_rmsd_1():
            gemmi.calculate_superposition(polymer1, polymer2, ptype, gemmi.SupSelect.CaP, transform=False).rmsd

        def cal_cap_rmsd_2():
            gemmi.calculate_superposition(
                polymer1, polymer2, ptype, gemmi.SupSelect.CaP, transform=True).rmsd

        def cal_all_rmsd_1():
            gemmi.calculate_superposition(polymer1, polymer2, ptype, gemmi.SupSelect.All, transform=False).rmsd

        def cal_all_rmsd_2():
            gemmi.calculate_superposition(
                polymer1, polymer2, ptype, gemmi.SupSelect.All, transform=True).rmsd

        time_cap_1 = timeit.timeit(cal_cap_rmsd_1, number=1000)
        time_cap_2 = timeit.timeit(cal_cap_rmsd_2, number=1000)
        self.assertLess(time_cap_1, time_cap_2)
        time_all_1 = timeit.timeit(cal_all_rmsd_1, number=1000)
        time_all_2 = timeit.timeit(cal_all_rmsd_2, number=1000)
        self.assertLess(time_all_1, time_all_2)

if __name__ == '__main__':
    unittest.main()

Anyway, your previous version of gemmi.calculate_superposition now convinced me. It is fine to remain unchanged. Thanks for providing this helpful package.

wojdyr commented 2 years ago

I guess there would be a difference if you'd call superpose_positions() directly.

calculate_superposition() spends most of the time in sequence alignment. Sequence alignment is used to find corresponding atoms in two polymers. Superposing positions of the corresponding atoms is relatively quick.

Do you have a reason to try to optimize it? (in your benchmark it's only 10ms for chains with >1000 residues)

NatureGeorge commented 2 years ago

I managed to separate the alignment part and directly benchmark superpose_positions. It turns out that the difference is at the level of μs.

ca1, ca2 = gemmi.prepare_superposition(polymer1, polymer2, ptype, gemmi.SupSelect.CaP)
al1, al2 = gemmi.prepare_superposition(polymer1, polymer2, ptype, gemmi.SupSelect.All)

%timeit -n 1000 gemmi.superpose_positions(ca1, ca2).rmsd
# 219 µs ± 11.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit -n 1000 gemmi.superpose_positions(ca1, ca2,transform=False).rmsd
# 215 µs ± 5.62 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit -n 1000 gemmi.superpose_positions(al1, al2).rmsd
# 1.64 ms ± 7.15 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit -n 1000 gemmi.superpose_positions(al1, al2,transform=False).rmsd
# 1.64 ms ± 11.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The reason for me to optimize it is that I constantly using calculate_superposition to calculate the RMSD but I do not need superpositions. I used to believe that the QCP method would be faster if the rotation matrix is not calculated. Now it seems that most of QCP's calculation expense is spent on calculating RMSD and for a compiled language (i.e. c++) the rest of the calculation is at a very low expense.