Closed JoaoRodrigues closed 9 years ago
Hi, Glad it is useful!
The reason it's done like this is I wanted to keep the CLI as simple as possible, and then let users who want custom functionality use the API instead (or possibly the Python interface, which is still crude). But, I think reading parameters from a file is not a bad idea, at least as an option, I will look into how to that (shouldn't be too difficult).
In the mean-time, the API provides two methods of specifying your own radii. You can call freesasa_calc_coord()
(in freesasa.h
) and supply an array of coordinates and an array of custom radii. Another way to do it is to use freesasa_structure_r()
(in structure.h
) which takes a function pointer as argument, the function pointed to should return an atom radius based on residue- and atom-name strings. I will consider adding similar functionality for classes as well.
Thanks! My C/C++ skills are awful, at best, :) so I just hacked my way through the classes and replaced values whenever necessary.
The easiest way, in my opinion, would be a format like NACCESS, where you have a per-residue/per-atom definition. It's very redundant, but it's very flexible and probably still keeps a very good performance. The last digit is a polar/apolar key, the rest is self-explanatory.
RESIDUE ATOM ALA 5
ATOM N 1.65 1
ATOM CA 1.87 0
ATOM C 1.76 0
ATOM O 1.40 1
ATOM CB 1.87 0
RESIDUE ATOM ARG 11
ATOM N 1.65 1
ATOM CA 1.87 0
ATOM C 1.76 0
ATOM O 1.40 1
ATOM CB 1.87 0
ATOM CG 1.87 0
ATOM CD 1.87 0
ATOM NE 1.65 1
ATOM CZ 1.76 0
ATOM NH1 1.65 1
ATOM NH2 1.65 1
I was thinking of something along the lines of
area_classes:
apolar
polar
atom_classes:
C_CAR 1.55 apolar
C_ALI 2.0 apolar
C_ARO 1.75 apolar
N 1.55 polar
S 2.00 polar
O 1.40 polar
atoms:
ALA N N
ALA CA C_ALI
ALA C C_CAR
ALA O O
ALA CB C_ALI
ARG N N
ARG CA C_ALI
ARG C C_CAR
ARG O O
ARG CB C_ALI
ARG CG C_ALI
ARG CD C_ALI
ARG NE N
ARG CZ C_ALI
ARG NH1 N
ARG NH2 N
...
Which is removes some of the redundancies, and it allows more classes than polar apolar in a transparent way. But it's not a bad idea to allow people to reuse their NACCESS settings either.
That's perfect, more like a topology. Whatever is easier to also parse, I think as long there is a well defined format people will easily convert it. Again, thanks!
Got something that seems to work now in the dev branch (4c92887a038db7). Gonna add tests for more edge cases, add documentation, etc, before I release it to master. I generated an example config file (tests/data/oons.config
)for the OONS classification, that I also used for tests, and gives identical results for my test case (1UBQ).
I ended up with the format below. So the amino acid ANY
can be used to specify backbone atom properties, etc. If the values for a specific atom is changed in another amino acid, ANY
will be overridden for that amino-acid (like for CB
in PRO
, see the file). Haven't explored it yet, but this should allow easy addition of special groups in for example HETATM
, but for now activating HETATM
will include all atoms, so waters and unwanted ligands must be filtered from the input pdb beforehand (might do something about this later). Also hydrogens will still be ignored for now.
Thanks for coming with this suggestion!
types:
C_ALI 2.00 apolar
C_ARO 1.75 apolar
C_CAR 1.55 polar
N 1.55 polar
O 1.40 polar
S 2.00 polar
atoms:
ANY C C_CAR
ANY O O
ANY CA C_ALI
ANY N N
ANY CB C_ALI
ANY OXT O
ARG CG C_ALI
ARG CD C_ALI
ARG NE N
ARG CZ C_ALI
ARG NH1 N
ARG NH2 N
ASN CG C_CAR
ASN OD1 O
ASN ND2 N
...
Of course the code is completely fresh, so please tell me if you find any quirks or bugs.
Great, thanks! I tested it here and compared the values to my ad-hoc implementation. There are some slight differences that I cannot understand where they might come from. Example below. Many more are spread out throughout the file. Any idea why? These are quite small, mainly cherry-picking, but since I a better correlation with NACCESS ASAs before, just wondering if there isn't something funny going on.
==> 3bzd.config <==
ATOM 1 N ALA A 2 33.763 52.112 -44.090 1.00 50.23
ATOM 2 CA ALA A 2 33.403 52.722 -42.739 1.00 15.51
ATOM 3 C ALA A 2 31.960 53.249 -42.614 1.00 1.71
ATOM 4 O ALA A 2 31.694 54.512 -42.582 1.00 0.71
ATOM 5 CB ALA A 2 34.416 53.856 -42.420 1.00 53.27
ATOM 6 N ALA A 3 31.006 52.310 -42.598 1.00 9.61
ATOM 7 CA ALA A 3 29.599 52.628 -42.417 1.00 2.76
ATOM 8 C ALA A 3 29.149 54.015 -41.825 1.00 0.00
ATOM 9 O ALA A 3 28.776 54.906 -42.573 1.00 0.24
ATOM 10 CB ALA A 3 28.955 51.482 -41.644 1.00 24.61
==> 3bzd.joao <==
ATOM 1 N ALA A 2 33.763 52.112 -44.090 1.00 50.23
ATOM 2 CA ALA A 2 33.403 52.722 -42.739 1.00 15.51
ATOM 3 C ALA A 2 31.960 53.249 -42.614 1.00 1.71
ATOM 4 O ALA A 2 31.694 54.512 -42.582 1.00 0.71
ATOM 5 CB ALA A 2 34.416 53.856 -42.420 1.00 53.27
ATOM 6 N ALA A 3 31.006 52.310 -42.598 1.00 9.61
ATOM 7 CA ALA A 3 29.599 52.628 -42.417 1.00 2.83
ATOM 8 C ALA A 3 29.149 54.015 -41.825 1.00 0.01
ATOM 9 O ALA A 3 28.776 54.906 -42.573 1.00 0.25
ATOM 10 CB ALA A 3 28.955 51.482 -41.644 1.00 25.53
p.s. could be useful to print the selected atomic radius to the occupancy column of each atom. Helps spot bad assignments.
Hmm, I can't reproduce the problem with my OONS-configuration. Ran diff on pdb files from using the internal OONS categories and the ones in my config-file.
Will add the radii to the occupancy field, that makes sense.
Latest version of dev now writes radii as requested.
Now also in the master branch.
Perfect, had a problem in the config file.
Another small issue. There is an dependency between the order of the definitions in the config file and the attribution of indexes to classes. If you put an apolar type first, you will get the summed areas printed in the 'Polar: ' entry of the output. I tracked it down to the definition of FREESASA_POLAR=0 in freesasa.h. Maybe this should also become more generic, in case somebody wants to add extra types.
You should consider getting a DOI for this, or even a small application note, in order to get some citations. NACCESS is a very old but widely used program. This is essentially a much better, dependency-free version that will be used, if it's a bit publicised. Great job, really, I was already considering writing my own implementation when I found this.
Ah, good, I hadn't caught that error. Will make it use the string provided in the config. Should be allowed to use arbitrary number of classes with arbitrary names.
I have been planning to publish this somehow, but still want to iron out some things and think through the API once more before I make it official. Having somebody look at it really helps, so thank you!
This library also has it's roots in frustration with naccess :)
Works for arbitrary class-names now.
Thanks. I cleaned and updated my fork, added a 'share' folder with the naccess parameters. I won't issue a pull request, but if you want to take them, go ahead. A quick comparison using chain A of that PDB file in the tests folder gives a 0.99 R^2 correlation with NACCESS values (link to xlsx file). There are a few differences in some atom types (CE, NZ, NH2), but this might dilute with larger sampling.
Again, thanks and keep up the good work. I'll keep an eye. If you need some more testing, let me know, I think we will use this quite often from here on.
It's probably hard to get identical results since the algorithms are approximate, but 0.99 is already pretty good! Did you use Lee & Richards with the same settings as NACCESS (I think the default delta is 0.05 Å there)? Seems like the largest errors are for uncommon atom types, so you're probably right that it's a statistical thing.
It is definitely useful to have the NACCESS parameters, I will add them to the repo.
If you have a look at doc/manual.pdf
, the appendix has a precision analysis and comparison between the two algorithms. For really high precision (slow) they give practically identical results. The plots are for an earlier version of the program - they give a correct general idea, but I have made some optimizations that improve the scaling with number of atoms that I haven't made new plots for yet.
Using L&R with FreeSASA is about five times slower than NACCESS on my computer (not sure why), but using S&R-settings that give a similar degree of accuracy is about the same speed, so I recommend using S&R.
I have looked deeper into the comparison with NACCESS, by inspecting the source. The z-parameter in NACCESS is not the slice width, but is multiplied by the diameter of the atom (including probe). So for the standard z=0.05 the slice width will vary between 0.05 * 2 * (1.87+1.4) = 0.327 Å (largest carbon) and 0.0 * 2 * (1.4+1.4) = 0.28 Å (oxygen). This probably explains why the correlation is only 0.99 for atomic surface areas between FreeSASA and NACCESS. FreeSASA operates with a fixed slice width (which means slightly fewer slices for smaller atoms) and NACCESS with a fixed number of slices per atom.
Also, now that I understand how to compare the two, the two programs finish in approximately the same time on my computer when run with comparable precision (-d 0.3 in FreeSASA and -z 0.05 in NACCESS). This was a relief - I could not understand why FreeSASA would be so much slower.
Great! I had blamed it on the implementation (different language, different optimisations), but the differences were indeed specific for some atom types. That settles it then!
Btw, if you prefer working in Python the bindings are getting quite mature now. Enable them by configuring with --enable-python-binding
. I have to admit I have very little Python experience, but they compile without hassle (using Cython) and pass all my tests. The function testCalc() in bindings/pyhton/test.py should illustrate the the simplest use cases. If this is useful for you I would be happy for any feedback.
I saw them, will use them if we change our code as well. As of now, it's a simple unique call so, quite an overkill to use the bindings. But I'm happy to go over the code as soon as I have some time.
I have extended the file naccess.config to include nucleic acids, and exported the definitions to C code (using the script script/config2c.pl
) so that
$ freesasa -c share/naccess.config 1abc.pdb
will give the same output as
$ freesasa --radii=naccess 1abc.pdb
This means you don't have to keep track of the config-files, and additionally allowed me to add RSA output as an option. The command:
$ freesasa --radii=naccess --rsa-file=abc.rsa 1abc.pdb
will give an RSA-file very similar to that of NACCESS (run with the -b
flag). (I did it this way because I need a way of making sure the radii correspond to the reference values used to calculate the REL columns in the RSA file).
@mittinatten That makes things much easier alright, thanks!
Hi there,
Great library, exactly what I was looking for to replace an outdated dependency.
One suggestion: it might be less efficient, but it would be much easier to have a parameter file with the radii and classes for the calculations. As is, it takes a while to edit the code in classify.c/h .. Also makes it easier to run comparisons between different sets of parameters.