Output CIF PoC - Githubissues

danny305 commented 3 years ago

Here is a working version of writing a CIF file.

It works with --hetatm and --hydrogen but has not been thoroughly tested.

Let's get this production ready!

coveralls commented 3 years ago

Coverage remained the same at 97.477% when pulling d569e3088cb8a5658b8f456b29c2bccfaae07051 on danny305:feature/cif into 3f731757238d15b0d83bcb3c37432db0f3925c98 on mittinatten:feature/cif.

mittinatten commented 3 years ago

This looks great!

These are the things I can think of need to be there, before it can be merged, in order of decreasing importance (approximately):

It should be possible to use freesasa as a unix filter, i.e. output should be written to stdout by default
It would be best if we could avoid reading the file twice, i.e. store a reference to the input document somehow, as discussed earlier (https://github.com/mittinatten/freesasa/issues/53#issuecomment-788183999). A really simple solution would be to store a Map<filename, document> at file scope and then get the document back from there. Since this is the CLI and not the library, it's probably ok (i.e. we don't have to worry about edge cases such as several entries for the same filename, or filling up memory with too many documents). Although I would prefer something that doesn't depend on history and that doesn't rely on matching the filename (as in the suggestions linked in #53), we'll have to weight that against the simplicity of the dictionary-based-solution. We could maybe go for the simpler solution first, and then do something better later, time and inspiration permitting.
If the format supports it: add a section to the output mentioning the version of FreeSASA used, and possibly some summed up areas, i.e. at least the total SASA and probably also per chain. What are the guidelines for adding custom sections to CIF?
Add a few regression tests to the CLI test suite. Here we can for instance store an example output once we feel we're happy with the solution, and then use that for future reference in order to catch regressions.
Add residues with relative SASA values (related to 3, just less important).
anything else?

So 2. is the only complicated one, the rest should be pretty straight forward as far as I can tell.

mittinatten commented 3 years ago

A few more things:

We need to catch if someone tries to output PDB when the input is CIF and vice versa
Handle --separate-chains and --separate-models in the output? Not sure whats correct here. An easy way out is to disallow those two options to be used together with --format=cif

danny305 commented 3 years ago

This is really straight forward and have already implemented it just haven't pushed yet.
I will need to think about a few ways of implementing this but this is totally doable. For the PoC I was not focused on this and knew I would need to circle back to this. I am unsure on how to incorporate this into the C library without having to do major changes and bringing the library to C++. I am in favor of the simple solution first approach.
Yeah I need to spend sometime learning how to add an additional section. Totally doable just haven't made it that far yet on playing with gemmi. I will probably go the route of adding every layer of output in a freesasa section and leave the atom values in the atom_site loop.
I will probably need your help with this when I get around to addressing this point.
What do you mean by this? can you elaborate?

Ill try to start addressing 1, 2, and 3 this weekend.

danny305 commented 3 years ago

So --separate-chains and --separate-models work independently but not when both are supplied. So I agree that they need to not be allowed to be used together.

Ill see how to do this. I think I have seen some code where you disallow different pairs that I can just extend.

Yeah good point. The code needs a cif input for a cif output to work. The way to catch it is to require the --cif flag to be set in order to use --format=cif. Let me know if you think this is sufficient.

mittinatten commented 3 years ago

Your plan for handling illegal combinations of options makes sense.

Just to be clear, we allow

--cif --format=cif --separate-models I assume for this one and the next, we write the data for each atom, and we simply leave it up to the user to keep track which options were used to generate the output
--cif --format=cif --separate-chains
--cif --format=<not cif or pdb> --separate-chains --separate-models

We don't allow

--cif --format=cif --separate-chains --separate-models

For point 5 above: By relative SASA I mean the value that's relative to a reference value, which is for instance available in the RSA output format. If you look at "relative-area" in json.c (around line 109-112) you can see how it's done there. So this is an extension for 3. Also depending on what classifier is used, this may or may not be available. If it's not, the getter will simply return NULL.

For point 2: Let's try the simple solution first then, if something comes up we can always try doing something that's more robust and reentrant.

Btw, under 3, we should probably add a remark somewhere about which parameters were used to generate the output as well, similar to what's in the standard output (i.e. if what you get if you don't pass any --format option).

danny305 commented 3 years ago

Yo big boss Simon!

Sorry I been AWOL grad school life. Got pulled in other directions.

I implemented 1 and 2. I am starting to work on some of the other ones but I found a potential bug I need to discuss with you.

So when I try writing out a cif file for 2isk.cif or 1sui.cif: freesasa --cif --format=cif 2isk.cif.

I get an error when it gets to the first HETATM line: Did not find row for: Atom(1 H 502 [FNR] C9A) Looping through entire table... Still couldnt find: Atom(1 H 502 [FNR] C9A)

I get a similar error message for the first HETATM line in the 1sui.cif (Ca2+ ion) The error is due to the fact the wrong chain gets assigned to the atom (line:427 auto cName = freesasa_node_name(chain);) so it's looking for the atom and it does not find it in the cif file. So in the above error message the chain should be A not H, which is why it does not find the atom.

I am unsure if this is due to an implementation detail on how you make the tree structure and I am handling incorrectly or if its an actual bug.

Could you help me address it? Or point me in the best way to get the --hetatm flag working on writing a cif file?

I am going to push the implementations for 1 and 2. I am sure there are better ways to do it (your map suggestion) I just forgot your suggestion in between the time you suggested it and the when I got around to doing it. Looking forward to some feedback!

danny305 commented 3 years ago

I fixed that bug in my previous post by editing the freesasa atom node. I essentially just added the properties I needed for each atom to the atom_properties.

Last thing I need to do is look at relative SASA. I will look at your json.c implementation to see what is needed for that.

The travisCI build fails are weird. It says that it cannot find gemmi::cif::write_cif_block_to_stream but the function declaration is in #include <gemmi/to_cif.hpp> so I don't know what to do here.

I will clean up the code and get rid of print statements and then after that I think everything I have implemented so far is pretty much done. Please let me know if you find anything I overlooked or messed up on.

mittinatten commented 3 years ago

Great, sorry for the slow response, been preoccupied with work here too. I'll have a closer look at the code in the next couple of days, but at a first glance it looks clean and nice. I'll also see if I can reproduce the Travis error and figure out what's going on.

danny305 commented 3 years ago

I added an RSA block to the cif output file.

I am awaiting your feedback before I go in and clean up the code and remove print statements, etc.

Take your time! Just let me know!

mittinatten commented 3 years ago

I checked out your code and ran it on 2isk.cif. Seems to work well!

I noticed a couple of things, in addition to the comments above. It breaks if the input is stdin or if the file is in another directory i.e. if I call freesasa --cif --format=cif ./path/to/pdbs/2isk.cif.

The output section adds quite some computation time, which may or may not be a problem. I did a few runs on my machine and it seems to increase the total time by at least 25%. It could at least partially be that a lot more data is written. But another thing I noticed is that the way atoms are matched to rows is O(N^2) (N: number of atoms), as far as I can tell. Not sure how to solve that, and we can probably live with it for the first version, but do you now if Gemmi has an optimized search interface? Since this is a (usually) sorted list, it should be possible to speed it up significantly (but I don't think we should do it ourselves).

danny305 commented 3 years ago

Yeah so at first it was O(N^2). However, I start the row scanning by starting at the index of the previous atom. So usually the second for loop is O(1) (The next atom) and only if the following cif atom does not match the freesasa_node_atom does it become O(N^2) for that specific atom.

mittinatten commented 3 years ago

Ok, didn't see that. That's a clever solution, and certainly good enough. I guess the differences are due to I/O then. I saw the calculation was also significantly slower than when using the corresponding .pdb-file. I guess that's the cost of using a larger file format and a more robust parser.

mittinatten commented 3 years ago

I updated my feature/cif branch with the lates version of Gemmi, so if you merge that in the current Travis problem should disappear, although the other error with --separate-chains and --hetatms remains.

I guess you could cherry-pick that particular commit and not the one that adds the failing test to get passing builds here.

danny305 commented 3 years ago

I have fixed the bug preventing the writing of a cif file passed in as a relative file path and from stdin. So my current implementation works with any number of relative cif file paths but only 1 cif file can be read from stdin. The code will fail saying it cannot determine which gemmi document to use (if there are more than 1) since it doesn't know the pdb code of the stdin cif file. I hope this is good enough?

Let me know on what you want to do about the freesasa_rsa.source. I commented above.

I guess all that is left is the --separate-chains --hetatm bug you found here?

Is there anything else we need to address to finalize CIF functionality?

mittinatten commented 3 years ago

Looks like we're really close now! It seems the CLI tests for --separate-chains pass now, but something in the unit tests is broken. I am logging out for the day now, but can have a look later if it's not obvious to you.

mittinatten commented 3 years ago

The logs say out of memory error. I would guess there are some assumption in the tests for node.c that are not valid anymore and we are trying to strdup a null string or something.

danny305 commented 3 years ago

The error has something to do with freesasa_tree_add_result leading to a memory error? Im not sure what to do about this.

I think everything on my end is done. Let me know when you want me to clean up the code and refactor a bit.

mittinatten commented 3 years ago

The tests that break are part of the code that checks that if malloc fails, we get an error code back instead of a crash. Probably something goes wrong in the new code in node.c, ~~my hunch is that the new fields on lines 18-19 are const. Did a quick google, and seems you will run into trouble if you free a const pointer (don't understand why)~~ I can try to look into i later today. Otherwise I think we are only a clean up away from a merge (I will do som more test runs later today).

mittinatten commented 3 years ago

I figured it out, the two new fields in the atom node need to be explicitly initialized to NULL just like .pdb_line. As it is now, if allocation fails and then go to the cleanup code, we will sometimes be freeing non-null pointers. I made a PR to your branch, so now we have a PR to a PR to a PR 😵

danny305 commented 3 years ago

If you'd like I can refactor the code so the pointers are initialized toNULL and then if it fails go to the cleanup code? Or you can just take care of it in your PR of a PR of a PR lol. Let me know!

mittinatten commented 3 years ago

Yes that's what's in the PR :)

mittinatten commented 3 years ago

The build passes on my local branch now btw, but I guess you would have to activate a Travis account to be able to have the same feedback on your fork.

danny305 commented 3 years ago

Can you better explain to me the functional difference of initializing the pointer with NULL instead? I don't understand why that matters in C.

danny305 commented 3 years ago

So I think all that is left is for me to refactor/tidy up a bit and remove std::cout statements littered throughout the PR. Ill make a final commit sometime later today.

Let me know if anything else comes up!

mittinatten commented 3 years ago

So if we don't initialize them to NULL, the value will be whatever happened to be at that point in memory, so usually garbage - there are no default values. In this case we have three memory allocations, if any of them fails we free the memory allocated so far. Calling free(NULL) is a noop, so that's fine, but calling free("random adress") leads to an abort or seg fault depending on your system (unless your garbage address happens to be NULL or something the process owns that has been malloced). So if the first strdup fails, we free all three, the first pointer will be NULL because of the error, but the other will not, if not initialized, and will thus cause problems when freed.

mittinatten commented 3 years ago

I think after tidying we can merge this to the main PR, and I'll run through the larger set of PDBs again and do one last review of the code as a whole to check that everything's consistent. I think we can go to beta then, so people can start using it, and we'll have time to fix minor things that may come up before the final release. (For instance why we need to store residue number on the atom node).

danny305 commented 3 years ago

Okay I'll get a cleaned up commit sometime today!

So just to be clear: what we needed in the atom node was the correct chain name. I added the residue name and residue number simply to be complete and have all the information in one place--the atom node.

The real bug was that some hetatm atoms are under the wrong parent chain node so they could not be located when parsing the cif file during the cif writing.

danny305 commented 3 years ago

Thanks for the explanation! It makes total sense!!

mittinatten commented 3 years ago

Nice! 👍 Can't say for sure when I'll do the final review, possibly during the weekend, depending on the weather 🌞 🌧

mittinatten / freesasa

Output CIF PoC #59