primaryodors / primarydock

PrimaryOdors.org molecular docker.
Other
6 stars 4 forks source link

Compiler compatibility/std::vector/performance testing. #32

Open objarni opened 2 years ago

objarni commented 2 years ago

I'm having trouble running make on my Mac machine, and after squashing a lot of small issues, I started reading on the va_args behaviour/specification in the standard, and while googling found this interesting article:

https://www.linkedin.com/pulse/modern-c-variadic-functions-how-shoot-yourself-foot-avoid-zinin/

(The actual build error was squashed by adding -Wno-non-pod-varargs to CFLAGS in makefile, but it feels wrong to sweep a semantic issue/concern like this under the carpet)

Is this function essential to the code base? I notice there are at least 1 more version of it?

BTW - My "long-term" intention with getting up regression/automatic tests is to enable refactoring to std::vector instead of so much pointers/arrays, i.e. using modern C++ constructs instead (it was one of the first things I read about podock, and felt I could contribute to). It would mean passing a variable number of objects/values to a function would be much simplified - just pass a (reference) to a vector object - so while I like va_args it seems a bit over-the-top.

electronicsbyjulie commented 2 years ago

I'll remove va_args from the code. Fortunately, it is only being used in one section of code (lines 216 and 232 of main() in metal.cpp) so it will be easy to live without. I intend to keep podock's dependencies as nonspecific as possible so that it can be compiled on the widest range of systems and platforms, and it sounds like va_args might not be good for that.

electronicsbyjulie commented 2 years ago

Since automated tests do not have to run on my web host, it is certainly okay to set them up, with CMake and everything, in some way disconnected from the makefile. Not sure if you'd want to do that before converting the code to std::vector, but...

I looked at std::vector a while back and it looks promising, I just couldn't figure it out. If it will solve the problems that come from using arrays of pointers, it'd certainly be great to have valgrind not complain about mismatched deletes anymore. :D It would be best to do the std::vector in a separate branch so that the main branch stays a stable version, then once podock itself and the various tests check out okay in the std::vector branch, it can be merged.

By the way, podock itself intentionally has randomness in it, so that could interfere with using text diff tests on it. Mathematically the randomness allows better access to a very large set of possible molecular conformations that would take a very long time to compute thoroughly, so by randomizing the attempted conformations, each run is less likely to fail to find good configurations, and if a run doesn't produce any good output, it can be repeated for another try. I think the other dockers do this as well.

It's easy to see visually whether podock is working, but programmatically would be more of a challenge. There would ideally be a module that makes sure the ligand is mostly within the binding pocket, that its parts are near features of the protein that they would bind to, and that the breakdown of energies correlates to these binding features and the numbers add up to the total. It would be a pass/fail based on each metric falling within a defined threshold.

Good catch on the va_args!

objarni commented 2 years ago

Thanks for thourough answers!

I gave up on compiling on MacBook Air M1 - was too many errors for now. I did spot some odd things though - I believe it's clang that is more picky than g++ (which I'm guessing you're using?) - and I might fix them even if it compiles with g++. E.g. initializing arrays with ={}, a construct which was new to me. I don't think it's legal/standard C++ but rather some g++ wizardry.

So I'm back at my Ubuntu machine, which is g++.

W.r.t vector, yes it's possible to pass around pointers, but even better is using shared_ptr which makes for a simple 'reference counting' system, with high performance. No more new/delete/alloc/free calls in the code base, so valgrind should be quiet after that!

While I agree we need a stable version (preferrably at all times), I'm not so keen on doing a separate branch, since it would have to be merged/rebased continously as it's a large refactoring.

I'd much rather make small, iterative changes to the code, switching one array for vector at a time.

But before those small-refactoring-steps can be made, I want to have:

1) automatic tests that I trust (the diff regressions / reports we have now might not be enough; in fact what you've written above about the metrics sounds much more robust! Visual testing is harder, but doable - e.g rendering to ASCII art) 2) code coverage close to 100%, to increase trust

The code coverage part is dead simple with CMake based project in CLion (my IDE of choice for C/C++ code bases), so that would be one way to do it. But it is still possible with simple make and gcov. So I see it as a bit of a challenge/fun to get coverage without adding CMake. :)

It is good to know, however, that adding a CMakeLists.txt to the repo wouldn't interrupt your web host / deployment environment! In case I give up ;)

objarni commented 2 years ago

Oh, the ={} syntax is actually valid C++ since C23:

int a[3] = {}; // valid C++ way to zero-out a block-scope array; valid in C since C23

Found here: https://en.cppreference.com/w/c/language/array_initialization

Pardon my ignorance!

electronicsbyjulie commented 2 years ago

C23... interesting. I'm not sure what to do. In the interest of compatibility it probably should not depend on the C23 standard since the makefile specifies C++11.

It's definitely not good that the app won't compile on a Mac. :(

Would it work to use vectors of shared_ptrs? Since the code currently uses types like Atom** and AminoAcid***.

objarni commented 2 years ago

std::vector<std::shared_ptr<Point>> is a valid type in C++, yes! And for some small structs/classes, like Point, it makes sense to simplify to std::vector<Point>, since the copying is cheaper than allocation/deallocation of Points (which still happens under the surface, it's just that new/delete/malloc/free isn't visible in the code base anymore). BUT: I would wait with making assumptions until I can measure the performance of a big simulation. Then, with the data from real perf.measures, try both and see.

electronicsbyjulie commented 2 years ago

A docking run would be a good simulation to use for measuring performance. Here's an archive file containing a PDB, an SDF, and a podock config file to run a docker test:

dock_sample.tar.gz

After extracting the contents to the podock folder, you can use the command ./podock test_TAAR8.config to run a calculation that should take somewhere between 5-30 minutes and should report the total run time at the end. (If it takes too long, you can decrease the "POSE" value in the .config file.) It also creates an output TAAR8_test.txt file that can be opened in the 3D viewer.

If we change the code to use std::vector, would it make sense to rename the Vector class? It's really a spherical coordinate construct, so maybe it could be something like SphericalVector or abbreviated to SphVec?

objarni commented 2 years ago

On the C23 vs C++11 regard, I think at least C++14 could be "assumed" nowadays. It is 2022 after all - 11 years since C++11 was released. But I'm fine with focusing on C++11 - it's a large leap compared to C++03!

objarni commented 2 years ago

I think it makes sense to rename Vector to something else, yes, and would suggest something similar to what wikipedia and your note above says - SphericalCoordinate or perhaps SCoord? I'm not a big fan of abbreviations in general, but for really basic/common types I'm OK with it...

objarni commented 2 years ago

I've managed to run the simulation, however don't know how to use the visualization web page. It would be fun to see result of simulation!

It takes about 69 seconds on my machine, with pose = 1. This is great material for perf.measure, not too long, but not too short either!

objarni commented 2 years ago

I added a testdata/ folder with the content of the archive, and added a performance test target to run it. Since the simulation outputs number of seconds, I think this is well enough to get started optimizing!

However, still stuck with the code-coverage issue; will hack on gcov solution for awhile, and if it is too much effort will switch to CMakeLists.txt based approach.

electronicsbyjulie commented 2 years ago

C++14 sounds good, I suppose it would be highly unusual to run anything as CPU intensive as a molecular docker on a system older than 5-10 years.

SCoord works for me. Even though it erases the distinction that it refers to relative coordinates while a Point refers to absolute coordinates, however sometimes the code uses a Point as a relative coordinate so I guess the distinction is not so important. The LocatedVector class can also be renamed, maybe it could be either LSCoord or similar.

The viewer.htm page can be opened in any browser directly as a local file (e.g. file:///home/path/to/podock/viewer.htm) or it is also available on the website at http://primaryodors.org/viewer.htm. There is a file control at top left, the kind used for an uploader although nothing is actually uploaded anywhere. It lets you select a local file which the page will then render in 3D.

Great to hear about optimization! I know podock is not as fast as AutoDock or some of the more streamlined dockers, e.g. LeDock. Any speed improvement would be super awesome and much appreciated!

electronicsbyjulie commented 2 years ago

Hope it's okay I renamed the issue.

objarni commented 2 years ago

Cool! 👯‍♂️

image

objarni commented 2 years ago

As per above discussion, switched to C++14. Also, ignoring gcov output files.

I hope it's OK to 'push commits' like this? I have a habit of doing rebase very often, so I don't need the extra hassle of pull-requests...

electronicsbyjulie commented 2 years ago

Sure! Long as nothing breaks. I also frequently push commits directly and, to be honest, sometimes I do break something, so it's not much of a worry. 😄

I just pulled the latest and rebuilt it from clean. It looks really good. There used to be a lot of compiler warnings and now there's just a few, it's great! Thank you for solving those!