Closed danny305 closed 3 years ago
This looks great!
These are the things I can think of need to be there, before it can be merged, in order of decreasing importance (approximately):
Map<filename, document>
at file scope and then get the document back from there. Since this is the CLI and not the library, it's probably ok (i.e. we don't have to worry about edge cases such as several entries for the same filename, or filling up memory with too many documents). Although I would prefer something that doesn't depend on history and that doesn't rely on matching the filename (as in the suggestions linked in #53), we'll have to weight that against the simplicity of the dictionary-based-solution. We could maybe go for the simpler solution first, and then do something better later, time and inspiration permitting.So 2. is the only complicated one, the rest should be pretty straight forward as far as I can tell.
A few more things:
--separate-chains
and --separate-models
in the output? Not sure whats correct here. An easy way out is to disallow those two options to be used together with --format=cif
Ill try to start addressing 1, 2, and 3 this weekend.
So --separate-chains
and --separate-models
work independently but not when both are supplied. So I agree that they need to not be allowed to be used together.
Ill see how to do this. I think I have seen some code where you disallow different pairs that I can just extend.
Yeah good point. The code needs a cif input for a cif output to work.
The way to catch it is to require the --cif
flag to be set in order to use --format=cif
. Let me know if you think this is sufficient.
Your plan for handling illegal combinations of options makes sense.
Just to be clear, we allow
--cif --format=cif --separate-models
I assume for this one and the next, we write the data for each atom, and we simply leave it up to the user to keep track which options were used to generate the output--cif --format=cif --separate-chains
--cif --format=<not cif or pdb> --separate-chains --separate-models
We don't allow
--cif --format=cif --separate-chains --separate-models
For point 5 above: By relative SASA I mean the value that's relative to a reference value, which is for instance available in the RSA output format. If you look at "relative-area" in json.c (around line 109-112) you can see how it's done there. So this is an extension for 3. Also depending on what classifier is used, this may or may not be available. If it's not, the getter will simply return NULL.
For point 2: Let's try the simple solution first then, if something comes up we can always try doing something that's more robust and reentrant.
Btw, under 3, we should probably add a remark somewhere about which parameters were used to generate the output as well, similar to what's in the standard output (i.e. if what you get if you don't pass any --format
option).
Yo big boss Simon!
Sorry I been AWOL grad school life. Got pulled in other directions.
I implemented 1 and 2. I am starting to work on some of the other ones but I found a potential bug I need to discuss with you.
So when I try writing out a cif file for 2isk.cif or 1sui.cif:
freesasa --cif --format=cif 2isk.cif
.
I get an error when it gets to the first HETATM line:
Did not find row for: Atom(1 H 502 [FNR] C9A) Looping through entire table... Still couldnt find: Atom(1 H 502 [FNR] C9A)
I get a similar error message for the first HETATM line in the 1sui.cif (Ca2+ ion)
The error is due to the fact the wrong chain gets assigned to the atom (line:427 auto cName = freesasa_node_name(chain);
) so it's looking for the atom and it does not find it in the cif file. So in the above error message the chain should be A
not H
, which is why it does not find the atom.
I am unsure if this is due to an implementation detail on how you make the tree structure and I am handling incorrectly or if its an actual bug.
Could you help me address it? Or point me in the best way to get the --hetatm
flag working on writing a cif file?
I am going to push the implementations for 1 and 2. I am sure there are better ways to do it (your map suggestion) I just forgot your suggestion in between the time you suggested it and the when I got around to doing it. Looking forward to some feedback!
I fixed that bug in my previous post by editing the freesasa atom node. I essentially just added the properties I needed for each atom to the atom_properties.
Last thing I need to do is look at relative SASA. I will look at your json.c implementation to see what is needed for that.
The travisCI build fails are weird. It says that it cannot find gemmi::cif::write_cif_block_to_stream
but the function declaration is in #include <gemmi/to_cif.hpp>
so I don't know what to do here.
I will clean up the code and get rid of print statements and then after that I think everything I have implemented so far is pretty much done. Please let me know if you find anything I overlooked or messed up on.
Great, sorry for the slow response, been preoccupied with work here too. I'll have a closer look at the code in the next couple of days, but at a first glance it looks clean and nice. I'll also see if I can reproduce the Travis error and figure out what's going on.
I added an RSA block to the cif output file.
I am awaiting your feedback before I go in and clean up the code and remove print statements, etc.
Take your time! Just let me know!
I checked out your code and ran it on 2isk.cif. Seems to work well!
I noticed a couple of things, in addition to the comments above. It breaks if the input is stdin or if the file is in another directory i.e. if I call freesasa --cif --format=cif ./path/to/pdbs/2isk.cif
.
The output section adds quite some computation time, which may or may not be a problem. I did a few runs on my machine and it seems to increase the total time by at least 25%. It could at least partially be that a lot more data is written. But another thing I noticed is that the way atoms are matched to rows is O(N^2) (N: number of atoms), as far as I can tell. Not sure how to solve that, and we can probably live with it for the first version, but do you now if Gemmi has an optimized search interface? Since this is a (usually) sorted list, it should be possible to speed it up significantly (but I don't think we should do it ourselves).
Yeah so at first it was O(N^2). However, I start the row scanning by starting at the index of the previous atom. So usually the second for loop is O(1) (The next atom) and only if the following cif atom does not match the freesasa_node_atom does it become O(N^2) for that specific atom.
Ok, didn't see that. That's a clever solution, and certainly good enough. I guess the differences are due to I/O then. I saw the calculation was also significantly slower than when using the corresponding .pdb-file. I guess that's the cost of using a larger file format and a more robust parser.
I updated my feature/cif
branch with the lates version of Gemmi, so if you merge that in the current Travis problem should disappear, although the other error with --separate-chains
and --hetatms
remains.
I guess you could cherry-pick that particular commit and not the one that adds the failing test to get passing builds here.
I have fixed the bug preventing the writing of a cif file passed in as a relative file path and from stdin. So my current implementation works with any number of relative cif file paths but only 1 cif file can be read from stdin. The code will fail saying it cannot determine which gemmi document to use (if there are more than 1) since it doesn't know the pdb code of the stdin cif file. I hope this is good enough?
Let me know on what you want to do about the freesasa_rsa.source. I commented above.
I guess all that is left is the --separate-chains --hetatm
bug you found here?
Is there anything else we need to address to finalize CIF functionality?
Looks like we're really close now! It seems the CLI tests for --separate-chains
pass now, but something in the unit tests is broken. I am logging out for the day now, but can have a look later if it's not obvious to you.
The logs say out of memory error
. I would guess there are some assumption in the tests for node.c
that are not valid anymore and we are trying to strdup
a null string or something.
The error has something to do with freesasa_tree_add_result
leading to a memory error? Im not sure what to do about this.
I think everything on my end is done. Let me know when you want me to clean up the code and refactor a bit.
The tests that break are part of the code that checks that if malloc fails, we get an error code back instead of a crash. Probably something goes wrong in the new code in node.c
, my hunch is that the new fields on lines 18-19 are const. Did a quick google, and seems you will run into trouble if you free a const pointer (don't understand why) I can try to look into i later today. Otherwise I think we are only a clean up away from a merge (I will do som more test runs later today).
I figured it out, the two new fields in the atom node need to be explicitly initialized to NULL
just like .pdb_line
. As it is now, if allocation fails and then go to the cleanup code, we will sometimes be freeing non-null pointers. I made a PR to your branch, so now we have a PR to a PR to a PR 😵
If you'd like I can refactor the code so the pointers are initialized toNULL
and then if it fails go to the cleanup code? Or you can just take care of it in your PR of a PR of a PR lol. Let me know!
Yes that's what's in the PR :)
The build passes on my local branch now btw, but I guess you would have to activate a Travis account to be able to have the same feedback on your fork.
Can you better explain to me the functional difference of initializing the pointer with NULL
instead? I don't understand why that matters in C.
So I think all that is left is for me to refactor/tidy up a bit and remove std::cout
statements littered throughout the PR. Ill make a final commit sometime later today.
Let me know if anything else comes up!
So if we don't initialize them to NULL, the value will be whatever happened to be at that point in memory, so usually garbage - there are no default values. In this case we have three memory allocations, if any of them fails we free the memory allocated so far. Calling free(NULL)
is a noop, so that's fine, but calling free("random adress")
leads to an abort or seg fault depending on your system (unless your garbage address happens to be NULL or something the process owns that has been malloced). So if the first strdup
fails, we free all three, the first pointer will be NULL because of the error, but the other will not, if not initialized, and will thus cause problems when freed.
I think after tidying we can merge this to the main PR, and I'll run through the larger set of PDBs again and do one last review of the code as a whole to check that everything's consistent. I think we can go to beta then, so people can start using it, and we'll have time to fix minor things that may come up before the final release. (For instance why we need to store residue number on the atom node).
Okay I'll get a cleaned up commit sometime today!
So just to be clear: what we needed in the atom node was the correct chain name. I added the residue name and residue number simply to be complete and have all the information in one place--the atom node.
The real bug was that some hetatm atoms are under the wrong parent chain node so they could not be located when parsing the cif file during the cif writing.
Thanks for the explanation! It makes total sense!!
Nice! 👍 Can't say for sure when I'll do the final review, possibly during the weekend, depending on the weather 🌞 🌧
Here is a working version of writing a CIF file.
It works with --hetatm and --hydrogen but has not been thoroughly tested.
Let's get this production ready!