mittinatten / freesasa

C-library for calculating Solvent Accessible Surface Areas
http://freesasa.github.io/
MIT License
103 stars 37 forks source link

fixed bug: unable to read cif that deviates from the pdb_id.cif format. #62

Closed danny305 closed 2 years ago

danny305 commented 3 years ago

This should close issue #61

It also addresses the issue where sometimes the freesasa_node_atom_residue_number(atom) returns a decimal string. I encountered this when I passed a cif file that I got as an output from the ChargeFW2 repo. I didn't realize that I patched that as well in this PR until now. Let me know if you would like that as a separate PR.

mittinatten commented 3 years ago

Hi, As I mentioned I am afk all of July, so excuse any silly errors below, I am on my phone.

I wonder how this will work if one is analyzing several files with the same source, e.g. 1abc_a.cif and 1abc_b.cif, which are modified somehow and FreeSASA is used to compare them. In the CLI this doesn’t matter, because we only have one file at a time. But in a python application, for instance, we could easily have many files. It is also an arbitrary and surprising feature that filenames need to start with the PDB code, which is not always the case, for instance if they are snapshots from a simulation, the name might be just a timecode (I guess that would have caused problems before too).

When we first added this we discussed reentrant versions, but landed on the simpler solution that works fine with the CLI. I think we should probably revisit the reentrant ideas. I think adding an optional field to the struct freesasa_structure with a pointer to the cif doc should work and be simple. Add a function freesasa_structure_set_cif_doc(structure, doc-pointer) which is called once before or after the atoms are added and then extracted when writing the output using freesasa_structure_get_cif_doc(structure).

Regarding the residue numbers, I am not sure they are always numbers? Don’t remember, are insertion codes a separate field or can it be part of the residue number? I think this part should be a separate PR since the other part might need some work still. (Not sure I’ll be able to review it properly until I’m back home).

ons 30 juni 2021 kl. 22:38 skrev Danny Diaz @.***>:

This should close issue #61 https://github.com/mittinatten/freesasa/issues/61

It also addresses the issue where sometimes the freesasa_node_atom_residue_number(atom) returns a decimal string. I encountered this when I passed a cif file that I got as an output from the ChargeFW2 repo. I didn't realize that I patched that as well in this PR until now. Let me know if you would like that as a separate PR.

You can view, comment on, or merge this pull request online at:

https://github.com/mittinatten/freesasa/pull/62 Commit Summary

  • fixed bug--unable to read cif that deviates from the pdb_id.cif format.

File Changes

Patch Links:

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mittinatten/freesasa/pull/62, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAVV4HBBGMCA2SRFAROM6RDTVN6DVANCNFSM47TC3SWA .

danny305 commented 3 years ago

I have refactored the code to do essentially what you said. The code no longer depends on the pdb code being part of the filename.

It was a bit more complicated than that. I had to edit the fresasa_structure and the freesasa_node structure properties. I was unsure on how to use gemmi types in your C code (structure.c and node.c) so instead I added a pointer to the filename as a char * and have a map on the C++ side that maps the filename to the gemmi doc.

I will update the PR today or tomorrow. Im cleaning it up first.

Insertion codes are separate fields from the residue number.

danny305 commented 3 years ago

I also refactored the code so it can now accept cif files that already have FreeSASA columns/data. Essentially, making the code idempotent.

danny305 commented 2 years ago

I have verified that it works with stdin.

I am unsure on how to edit your test. I know I did it once but that was a long time ago and I am focused on several other things here in the next few weeks.

Maybe you can pull request my pull request like you did in the past?? You could also tweak the pointer solution so only main has to deal with state.

I also wanted to ask you on how to make freesasa deal with cations? The cation values are always 0.000. I guess I can open up another issue for cations.

mittinatten commented 2 years ago

Sure, I'll have a go!

If the cations are hetatms, you should be able to include them using that option. But if they're some obscure element, the code might default back to radius 0.00. In that case, if you can find a sensible value for their radius, you can add it here: https://github.com/mittinatten/freesasa/blob/master/src/classifier.c#L885

mittinatten commented 2 years ago

I had a go at this a few weeks ago and I think that unless the CIF document is stored somewhere at file scope, the pointer will be already be invalid when we're back at the output stage. So then we might as well keep the map from filename to doc as in your solution. This is not ideal, because in a long running process with more and more structures this will essentially be a memory leak. A short term solution can be to add some kind of reset function so that the calling code can clean up manually when appropriate. We could perhaps even consider the map a cache: if the input isn't found in the map (because it was cleaned up) we can load it from disc again.

There are probably better, less stateful solutions, where the code that generates the CIF doc and the freesasa_structure bundle them somehow, but that would require some refactoring (and might require converting more code to C++, not sure). There might be another type of pointer that we can use, but I know too little C++ for that.

So I think for the 2.x version we should go for the solution already in this PR, perhaps enhanced with cache functionality and at least a reset function.

danny305 commented 2 years ago

How do you want to split this? What exactly do you want me to do?

I will be traveling for the next 10 days and will not be able to work on this.

I am also unsure how this could lead to a memory leak? If we have a map and there is a long running process with lots of cif files then the map will just grow in size? I don't see how it leaks.

Also, what you are looking for is what is called a shared_pointer in C++.

mittinatten commented 2 years ago

Sorry if I was vague, I guess I just wanted some feedback on my thoughts 😄

I assume a shared_ptr won't work if the only reference is in C code?

It's not a memory leak in the strictest sense, I guess, since there are still references somewhere to the documents. But for external users of the library it's essentially a leak, since they have now way of keeping the memory use from growing over time.

For the cache idea to work, we would need to pass file paths instead of FILE pointers to the functions in cif.cc. So that they can check if the file is in the cache and get it if not. So I think will skip that for now.

I have sketched a solution, where I use unique integers instead of filenames, as references to the documents, which are generated inside cif.cc. There are two advantages with that:

  1. One less argument to pass around, i.e. the implementation is completely opaque, and calling code will be simpler, less chance of messing things up. It will also work if somehow two separate structures are read from stdin at different times.
  2. No calls to strdup(), i.e. slightly less complex.

I also added the function freesasa_cif_clear_docs() to allow clearing the store of documents if necessary.

For some reason I was allowed to push this to your branch, my intention was to push it as a PR to your PR, as agreed, sorry about that!

I think there are still some wrinkles here to be ironed out. But conceptually I'm more or less happy with this solution. I guess I could use your C++ expertise to make sure my changes don't have any unintended consequences.

Once we've agreed on a solution, I'll add the extra tests and we should be done.

mittinatten commented 2 years ago

And no stress if it takes a while before you're available, as you've probably noticed I have limited time to work on this myself at the moment.

danny305 commented 2 years ago

I haven't forgotten about you. Ill get back to you next week. Putting out fires over here.

danny305 commented 2 years ago

Yo Simon.

I plan on going over this tomorrow and reviewing the code. Since we are hours apart. Could you point me to exactly what I need to review? It has been some time and I don't want to miss it. I reread your comment above and am happy with the solution as well.

Ill message you after I review it.

mittinatten commented 2 years ago

Hi, so if you could check that the C++ I wrote is sound that would be great, and there's the PR to this branch on your fork that needs to merged, if you don't have any objections (https://github.com/danny305/freesasa/pull/3). Also, as mentioned there, idempotency is broken.

danny305 commented 2 years ago

I forgot to implement idempotency to the freesasa tables. I had just implemented it for the atom_site table. They are implemented now and its working.

danny305 commented 2 years ago

I think this feature is ready for a merge?

danny305 commented 2 years ago

By the way your C++ was really good. Nice job refactoring to enable the feature.

mittinatten commented 2 years ago

Hi, I'll have one final look during the weekend and merge. I also think we're set.

mittinatten commented 2 years ago

Merged. Thanks for this!