mittinatten / freesasa

C-library for calculating Solvent Accessible Surface Areas
http://freesasa.github.io/
MIT License
105 stars 37 forks source link

Read Parameters from External File #5

Closed JoaoRodrigues closed 9 years ago

JoaoRodrigues commented 9 years ago

Hi there,

Great library, exactly what I was looking for to replace an outdated dependency.

One suggestion: it might be less efficient, but it would be much easier to have a parameter file with the radii and classes for the calculations. As is, it takes a while to edit the code in classify.c/h .. Also makes it easier to run comparisons between different sets of parameters.

mittinatten commented 9 years ago

Hi, Glad it is useful!

The reason it's done like this is I wanted to keep the CLI as simple as possible, and then let users who want custom functionality use the API instead (or possibly the Python interface, which is still crude). But, I think reading parameters from a file is not a bad idea, at least as an option, I will look into how to that (shouldn't be too difficult).

In the mean-time, the API provides two methods of specifying your own radii. You can call freesasa_calc_coord() (in freesasa.h) and supply an array of coordinates and an array of custom radii. Another way to do it is to use freesasa_structure_r() (in structure.h) which takes a function pointer as argument, the function pointed to should return an atom radius based on residue- and atom-name strings. I will consider adding similar functionality for classes as well.

JoaoRodrigues commented 9 years ago

Thanks! My C/C++ skills are awful, at best, :) so I just hacked my way through the classes and replaced values whenever necessary.

The easiest way, in my opinion, would be a format like NACCESS, where you have a per-residue/per-atom definition. It's very redundant, but it's very flexible and probably still keeps a very good performance. The last digit is a polar/apolar key, the rest is self-explanatory.

RESIDUE ATOM ALA  5
ATOM  N   1.65 1
ATOM  CA  1.87 0
ATOM  C   1.76 0
ATOM  O   1.40 1
ATOM  CB  1.87 0
RESIDUE ATOM ARG 11
ATOM  N   1.65 1
ATOM  CA  1.87 0
ATOM  C   1.76 0
ATOM  O   1.40 1
ATOM  CB  1.87 0
ATOM  CG  1.87 0
ATOM  CD  1.87 0
ATOM  NE  1.65 1
ATOM  CZ  1.76 0
ATOM  NH1 1.65 1
ATOM  NH2 1.65 1
mittinatten commented 9 years ago

I was thinking of something along the lines of

area_classes:
apolar
polar

atom_classes:
C_CAR 1.55 apolar
C_ALI 2.0 apolar
C_ARO 1.75 apolar
N 1.55 polar
S 2.00 polar
O 1.40 polar

atoms:
ALA N  N
ALA CA C_ALI
ALA C  C_CAR
ALA O  O
ALA CB C_ALI
ARG N  N
ARG CA C_ALI
ARG C  C_CAR
ARG O  O
ARG CB C_ALI
ARG CG C_ALI
ARG CD C_ALI
ARG NE N
ARG CZ C_ALI
ARG NH1 N
ARG NH2 N
...

Which is removes some of the redundancies, and it allows more classes than polar apolar in a transparent way. But it's not a bad idea to allow people to reuse their NACCESS settings either.

JoaoRodrigues commented 9 years ago

That's perfect, more like a topology. Whatever is easier to also parse, I think as long there is a well defined format people will easily convert it. Again, thanks!

mittinatten commented 9 years ago

Got something that seems to work now in the dev branch (4c92887a038db7). Gonna add tests for more edge cases, add documentation, etc, before I release it to master. I generated an example config file (tests/data/oons.config)for the OONS classification, that I also used for tests, and gives identical results for my test case (1UBQ).

I ended up with the format below. So the amino acid ANY can be used to specify backbone atom properties, etc. If the values for a specific atom is changed in another amino acid, ANY will be overridden for that amino-acid (like for CB in PRO, see the file). Haven't explored it yet, but this should allow easy addition of special groups in for example HETATM, but for now activating HETATM will include all atoms, so waters and unwanted ligands must be filtered from the input pdb beforehand (might do something about this later). Also hydrogens will still be ignored for now.

Thanks for coming with this suggestion!

types:
C_ALI 2.00 apolar 
C_ARO 1.75 apolar
C_CAR 1.55 polar
N 1.55 polar
O 1.40 polar
S 2.00 polar 

atoms:
ANY C   C_CAR
ANY O   O
ANY CA  C_ALI
ANY N   N
ANY CB  C_ALI
ANY OXT O

ARG CG C_ALI
ARG CD C_ALI
ARG NE N
ARG CZ C_ALI
ARG NH1 N
ARG NH2 N

ASN CG  C_CAR
ASN OD1 O
ASN ND2 N
...
mittinatten commented 9 years ago

Of course the code is completely fresh, so please tell me if you find any quirks or bugs.

JoaoRodrigues commented 9 years ago

Great, thanks! I tested it here and compared the values to my ad-hoc implementation. There are some slight differences that I cannot understand where they might come from. Example below. Many more are spread out throughout the file. Any idea why? These are quite small, mainly cherry-picking, but since I a better correlation with NACCESS ASAs before, just wondering if there isn't something funny going on.

==> 3bzd.config <==
ATOM      1  N   ALA A   2      33.763  52.112 -44.090  1.00 50.23
ATOM      2  CA  ALA A   2      33.403  52.722 -42.739  1.00 15.51
ATOM      3  C   ALA A   2      31.960  53.249 -42.614  1.00  1.71
ATOM      4  O   ALA A   2      31.694  54.512 -42.582  1.00  0.71
ATOM      5  CB  ALA A   2      34.416  53.856 -42.420  1.00 53.27
ATOM      6  N   ALA A   3      31.006  52.310 -42.598  1.00  9.61
ATOM      7  CA  ALA A   3      29.599  52.628 -42.417  1.00  2.76
ATOM      8  C   ALA A   3      29.149  54.015 -41.825  1.00  0.00
ATOM      9  O   ALA A   3      28.776  54.906 -42.573  1.00  0.24
ATOM     10  CB  ALA A   3      28.955  51.482 -41.644  1.00 24.61

==> 3bzd.joao <==
ATOM      1  N   ALA A   2      33.763  52.112 -44.090  1.00 50.23
ATOM      2  CA  ALA A   2      33.403  52.722 -42.739  1.00 15.51
ATOM      3  C   ALA A   2      31.960  53.249 -42.614  1.00  1.71
ATOM      4  O   ALA A   2      31.694  54.512 -42.582  1.00  0.71
ATOM      5  CB  ALA A   2      34.416  53.856 -42.420  1.00 53.27
ATOM      6  N   ALA A   3      31.006  52.310 -42.598  1.00  9.61
ATOM      7  CA  ALA A   3      29.599  52.628 -42.417  1.00  2.83
ATOM      8  C   ALA A   3      29.149  54.015 -41.825  1.00  0.01
ATOM      9  O   ALA A   3      28.776  54.906 -42.573  1.00  0.25
ATOM     10  CB  ALA A   3      28.955  51.482 -41.644  1.00 25.53
JoaoRodrigues commented 9 years ago

p.s. could be useful to print the selected atomic radius to the occupancy column of each atom. Helps spot bad assignments.

mittinatten commented 9 years ago

Hmm, I can't reproduce the problem with my OONS-configuration. Ran diff on pdb files from using the internal OONS categories and the ones in my config-file.

Will add the radii to the occupancy field, that makes sense.

mittinatten commented 9 years ago

Latest version of dev now writes radii as requested.

mittinatten commented 9 years ago

Now also in the master branch.

JoaoRodrigues commented 9 years ago

Perfect, had a problem in the config file.

Another small issue. There is an dependency between the order of the definitions in the config file and the attribution of indexes to classes. If you put an apolar type first, you will get the summed areas printed in the 'Polar: ' entry of the output. I tracked it down to the definition of FREESASA_POLAR=0 in freesasa.h. Maybe this should also become more generic, in case somebody wants to add extra types.

You should consider getting a DOI for this, or even a small application note, in order to get some citations. NACCESS is a very old but widely used program. This is essentially a much better, dependency-free version that will be used, if it's a bit publicised. Great job, really, I was already considering writing my own implementation when I found this.

mittinatten commented 9 years ago

Ah, good, I hadn't caught that error. Will make it use the string provided in the config. Should be allowed to use arbitrary number of classes with arbitrary names.

I have been planning to publish this somehow, but still want to iron out some things and think through the API once more before I make it official. Having somebody look at it really helps, so thank you!

This library also has it's roots in frustration with naccess :)

mittinatten commented 9 years ago

Works for arbitrary class-names now.

JoaoRodrigues commented 9 years ago

Thanks. I cleaned and updated my fork, added a 'share' folder with the naccess parameters. I won't issue a pull request, but if you want to take them, go ahead. A quick comparison using chain A of that PDB file in the tests folder gives a 0.99 R^2 correlation with NACCESS values (link to xlsx file). There are a few differences in some atom types (CE, NZ, NH2), but this might dilute with larger sampling.

Again, thanks and keep up the good work. I'll keep an eye. If you need some more testing, let me know, I think we will use this quite often from here on.

mittinatten commented 9 years ago

It's probably hard to get identical results since the algorithms are approximate, but 0.99 is already pretty good! Did you use Lee & Richards with the same settings as NACCESS (I think the default delta is 0.05 Å there)? Seems like the largest errors are for uncommon atom types, so you're probably right that it's a statistical thing.

It is definitely useful to have the NACCESS parameters, I will add them to the repo.

If you have a look at doc/manual.pdf, the appendix has a precision analysis and comparison between the two algorithms. For really high precision (slow) they give practically identical results. The plots are for an earlier version of the program - they give a correct general idea, but I have made some optimizations that improve the scaling with number of atoms that I haven't made new plots for yet.

Using L&R with FreeSASA is about five times slower than NACCESS on my computer (not sure why), but using S&R-settings that give a similar degree of accuracy is about the same speed, so I recommend using S&R.

mittinatten commented 9 years ago

I have looked deeper into the comparison with NACCESS, by inspecting the source. The z-parameter in NACCESS is not the slice width, but is multiplied by the diameter of the atom (including probe). So for the standard z=0.05 the slice width will vary between 0.05 * 2 * (1.87+1.4) = 0.327 Å (largest carbon) and 0.0 * 2 * (1.4+1.4) = 0.28 Å (oxygen). This probably explains why the correlation is only 0.99 for atomic surface areas between FreeSASA and NACCESS. FreeSASA operates with a fixed slice width (which means slightly fewer slices for smaller atoms) and NACCESS with a fixed number of slices per atom.

Also, now that I understand how to compare the two, the two programs finish in approximately the same time on my computer when run with comparable precision (-d 0.3 in FreeSASA and -z 0.05 in NACCESS). This was a relief - I could not understand why FreeSASA would be so much slower.

JoaoRodrigues commented 9 years ago

Great! I had blamed it on the implementation (different language, different optimisations), but the differences were indeed specific for some atom types. That settles it then!

mittinatten commented 9 years ago

Btw, if you prefer working in Python the bindings are getting quite mature now. Enable them by configuring with --enable-python-binding. I have to admit I have very little Python experience, but they compile without hassle (using Cython) and pass all my tests. The function testCalc() in bindings/pyhton/test.py should illustrate the the simplest use cases. If this is useful for you I would be happy for any feedback.

JoaoRodrigues commented 9 years ago

I saw them, will use them if we change our code as well. As of now, it's a simple unique call so, quite an overkill to use the bindings. But I'm happy to go over the code as soon as I have some time.

mittinatten commented 8 years ago

I have extended the file naccess.config to include nucleic acids, and exported the definitions to C code (using the script script/config2c.pl) so that

 $ freesasa -c share/naccess.config 1abc.pdb

will give the same output as

$ freesasa --radii=naccess 1abc.pdb

This means you don't have to keep track of the config-files, and additionally allowed me to add RSA output as an option. The command:

$ freesasa --radii=naccess --rsa-file=abc.rsa 1abc.pdb

will give an RSA-file very similar to that of NACCESS (run with the -b flag). (I did it this way because I need a way of making sure the radii correspond to the reference values used to calculate the REL columns in the RSA file).

JoaoRodrigues commented 8 years ago

@mittinatten That makes things much easier alright, thanks!