mittinatten / freesasa

C-library for calculating Solvent Accessible Surface Areas
http://freesasa.github.io/
MIT License
103 stars 37 forks source link

CIF: --separate-chains --separate-models implementations #55

Closed danny305 closed 3 years ago

danny305 commented 3 years ago

I got both --separate-chains and --separate-models working. I have tested it with a handful of PDBs so far. I essentially used your implementation for the pdb format (from structure.c) as my outline.

What has not yet been implemented yet is the n_chain_groups condition (lines 306 - 325). I refactored this code block a bit but I do not have a cpp/cif version for freesias_structure_get_chains.

I am unable to get the linker to find int freesasa_warn(const char *format, ...) from the util.c file. I am linking with libfreesasa.a so I do not know what is going on here. I even put the declaration in cif.hh. (I try using it in cif.cc line: 253)

My implementations clearly do not test all the possible edge cases and misallocation of memory like your original source code does. Mainly because I have never really thought about that stuff so I would like for you to take lead on these details. I want to get this from dev to production ready and learn from you as we do it.

mittinatten commented 3 years ago

Regarding linking freesasa_warn it should be solved if you add

#ifdef __cplusplus
extern "C" {
#endif

near the top of freesasa_internal.h and

#ifdef __cplusplus
}
#endif

near the end. (You can compare with the freesasa.h to see where it should be placed.)

This tells the C++ compiler that the functions listed are from a C library and should be linked differently (don't ask me about the details 😃 ).

mittinatten commented 3 years ago

Very nice that you got this far already! I'll do a thorough review soon. At a first glance there are two things though I wanted to mention.

  1. From the Travis logs (you can see them if you click the red cross next to the last commit) I see there is a seg-fault for --chain-groups.
  2. We should coordinate our indentation and brace styles, there is a .clang-format in the repo that the rest of the code follows. Can you apply that to your code too? (I have vscode setup to do it automatically on save)
danny305 commented 3 years ago

Yeah the --chain-groups is what I am referring to in my second point ^. (I do not have a cpp/cif version for freesias_structure_get_chains).

Its on the to-do list. I ill update the target branch probably today/tomorrow.

Ill give your extern C thing a go. Are you sure its the freesias_internal.h? the definition is in util.c. Can you elaborate?

mittinatten commented 3 years ago

Right, so when you compile the C++ code you just include the C header, the extern "C" declarations tells the compiler that these functions should be flagged for different linkage later when the library is linked in. The C++ compiler never touches util.c it just uses the declarations given in the header. (At least that's my understanding of it.)

I got the chain-groups to work with cif before, by just mapping to freesasa_structure and then reusing the old code, but leveraging the gemmi tools directly is probably a better solution in the long run. Especially if we anticipate a migration to gemmi also for PDB files eventually.

coveralls commented 3 years ago

Coverage Status

Coverage increased (+0.0007%) to 97.474% when pulling a22c07c5d6be288952d914d928d159d9171b5c10 on danny305:feature/cif into d1fd5a397823b4e973c8f0581dc91552d39938c4 on mittinatten:feature/cif.

danny305 commented 3 years ago

Simon. I did as you suggested to the freesias_internal.h file and it was able to link the function freesasa_warn().

I also fixed the memory error in --chain-groups. It was a typo on my end.

Next I will look into syncing up our indentation and brace style.

What else do we need to do for this PR? to merge the feature/branch to master?

mittinatten commented 3 years ago

Great! I will try to find time for a code review today, and then we can probably merge this PR.

There are at least two more things we need to do.

  1. Run the larger test set. I am getting close to having a runner ready, can probably wrap that up this week.
  2. Documentation. We should add the new feature to the man pages, the help message in the CLI and the online docs. Shouldn't be necessary to have more than one or two sentences either place.
danny305 commented 3 years ago

Okay I'll get that done tomorrow.

I can add a txt file to the repo tomorrow with several thousand protein PDB IDs and a python/shell script to pull down both the CIF and pdb files for a given protein. Do you have a preference if I use bash vs Python? Where would you like me to put this? I was thinking rootDir/dataset. Let me know if you prefer it else where.

What else can I do?

mittinatten commented 3 years ago

I have a script almost ready that does the downloading and test running, I just need to polish the error reporting a bit, so all I need is the list of PDB IDs

mittinatten commented 3 years ago

I learning F# at work right now, and running a sequence of commands and summing up the errors overlaps with a lot of the concepts we're discussing at the moment. So I decided to write the test runner in F# as an exercise, which means it takes a bit longer to write, but I hope it will result in pretty neat code.

danny305 commented 3 years ago

I'll get this to you in the morning. I'll just put the txt file in the root directory.

I'm going to spend sometime learning getopt and start working on editing the CLI for outputting a cif.

I'll also try to get cmake to build config.h based on the local system of the user.

I'll slowly implement as much as I can for the cmake build pipeline for whenever we are ready for that feature.

danny305 commented 3 years ago

Wow! I've never heard of F#. Sounds super cool. I'll get you that txt file here in the next few hours.

mittinatten commented 3 years ago

By the way, I'm not sure about cmake, the autotools setup works, and has the advantage that everything needed will be installed on most systems. It is a bit of a pain to maintain, but rarely needs to be changed. So don't go too far down that road before we've discussed it further. We can make it a separate issue anyway.

danny305 commented 3 years ago

On the CMake point, I am just in the process of learning cmake. So I have been using this as an opportunity to learn. There is no need to add it if its not needed. I just thought I would keep you in the loop on that front.

danny305 commented 3 years ago

So I addressed all the issues besides the last one. I added a few pdb/cif pairs to the tests/data directory. If you could help turn these into tests that would be great. I have manually compared to the cif/pdb pairs and they pass the eye test.

Im going to make 1 more commit tonight where I add a txt file of diverse pdb codes.

mittinatten commented 3 years ago

Regarding the CLI tests. The following line checks that the cli gives the same result for two variants of arguments (the last two strings)

assert_equal_total "$cli --chain-groups AB+CD -S -n 10" "$datadir/2jo4.pdb" "$datadir/2jo4.cif --cif"

So the new tests would add new lines where "$cli --chain-groups AB+CD -S -n 10" is replaced with "$cli --separate-chains -S -n 10" and the input files would be varied to use input files that have several chain, several models, and both several chains and models.

danny305 commented 3 years ago

I added the tests you suggested and found a bug in the code when there are multiple models of the same protein in the cif file.

I patched it but I don't like the solution. There is a lot of duplication in the structure_from* functions. I am going to see if I can refactor these into a template function or use variadic templates.

Nonetheless, the latest commit works.

mittinatten commented 3 years ago

Great! Just let me know when you are ready for a final review. My testrunner works now - I want to experiment a bit more with refactoring it before merging the code, for the sake of learning - but I'll run the tests before that, so we now where we stand with regards to finishing this feature.

danny305 commented 3 years ago

Its ready for your approval boss.

Unless there is some other stuff on this side of things. Im going to pivot to writing the cif file.