Relative Accessibilities

JoaoRodrigues commented 7 years ago

Hi Simon,

Great to see you keep working on FreeSASA!

I was trying to calculate relative solvent accessibilities (per-residue) and found some trouble.

It took me a while to find the relative accessibilities scales used to normalize the per-residue SASA values in the code. Maybe making them more explicit (in the config file?) would be more obvious? Like at the header of each entry (total, backbone, side-chain, etc).

ARG 123.45 67.89 01.23
ARG CG C_ALI
ARG CD C_ALI
ARG NE N_AMD
ARG CZ C_CAR
ARG NH1 N_AMD
ARG NH2 N_AMD

What exactly are the normalization values? I checked classifier_naccess.c and for example, for Arginine, it shows:
```
Total = 232.08
Main Chain = 45.58
Side Shain = 186.50
```

If I look at the reference values in NACCESS (which are what SASAs are normalized for) it shows:

Total = 238.76
Main Chain = 37.51
Side Chain = 201.25

What are the reference values then? In NACCESS, the author calculated them using a specific set of Ala-X-Ala peptides in extended conformation. I got different values using extended conformations generated with Pymol, so YMMV ... But if you provide NACCESS values, maybe sticking to those in the code would be better.

Cheers!

mittinatten commented 7 years ago

Hi Joao,

I agree that allowing the user to provide reference values in the config-file would be a useful addition, just never got around to implementing it. So your request is a perfect occasion have a look at it again.
The extended Ala-X-Ala conformations used to calculate maximum SASA can be found in scripts/rsa/. The configurations were generated using Profasi (http://cbbp.thep.lu.se/activities/profasi/). The files classifier_*.c were generated using scripts/config2c.pl by using classifier configuration files and the AXA input. The reason I did it this way is I didn't have access to the exact conformations used for NACCESS, and I wanted to use the same configurations for OONS, ProtOr, and NACCESS radii. But, now that I had a closer look at the Ala-X-Ala-files, I see some are reasonable and some are quite bad. Something went a bit too fast here (embarassing that this slipped through). So thanks for noticing the discrepancies here, this needs to be fixed as quickly as possible.

mittinatten commented 7 years ago

So, to begin with, a quick fix in eb2e6d3. I generated tripeptides in Pymol and recalculated reference values from those. Probably it's better to have standard values for NACCESS, will think about what would be the best syntax for specifying that in the config file.

JoaoRodrigues commented 7 years ago

I was about to send a pull request with the original NACCESS values. I contacted the author a while ago (when I first tested FreeSASA vs NACCESS actually) and he shared with me the set of extended peptides. I can share these with you if you want to have a set.

mittinatten commented 7 years ago

That would be great, if we could add those to the repository it would be even better. (He seems to be positive to the project in general https://f1000research.com/articles/5-189/v1#referee-response-12524)

JoaoRodrigues commented 7 years ago

Will do, thanks!

As for point 1, that's great by the way!

mittinatten commented 7 years ago

I have merged #28, thanks!

For the few amino acids I sampled the totals are now close (but not identical). I'm not sure what resolution was used to for naccess, but I used n=1000 (corresponds to z=0.001) in the calculation. But, there are differences in the definitions of main-chain/side-chain and polar/apolar, so these values are still different. I use C+N+O+CA as main-chain, and define all carbons as apolar.

I haven't verified that the classifier for naccess gives the same classes for all atoms as in naccess. Perhaps we need more than three types of carbon.

Possibly we should allow the configuration files to define which atoms are backbone and which are not. At the moment this is a global function.

JoaoRodrigues commented 7 years ago

So, when I benchmarked the code a couple of years back, the results were exactly the same. Unless you changed something, they should be the same. NACCESS uses z=0.05. I can have a look at reproducing the numbers exactly later in the month, don't worry about it!

Thanks again!

mittinatten commented 7 years ago

Yes, I also had very exact benchmark results when I wrote the paper. I would want to use high resolution for the default configuration, but it makes sense to make NACCESS a special case.

If using z=0.05 doesn't help, I'll have a look at the polar/apolar definitions, perhaps they don't match the original completely.

mittinatten commented 7 years ago

Ok, I've investigated further. I used ARG.pdb as an example. I get very close to identical RSA files in the ABS columns using z=0.01 in NACCESS and n=100 in FreeSASA, or z=0.05 and n=20. If a calculation is the same as the reference all REL columns should be 100.0. That's what I get with FreeSASA at n=1000 (which is verified in the test-suite).

NACCESS z=0.01:

REM RES _ NUM      All-atoms   Total-Side   Main-Chain    Non-polar    All polar
REM                ABS   REL    ABS   REL    ABS   REL    ABS   REL    ABS   REL
RES ALA S   1   164.19 152.1  83.16 119.8  81.03 210.2  85.28 119.5  78.91 215.7
RES ARG S   2   237.97  99.7 200.35  99.6  37.61 100.3  77.06  99.0 160.91 100.0
RES ALA S   3   159.00 147.3  71.06 102.4  87.94 228.2  83.76 117.3  75.24 205.7

FreeSASA n=100:

REM RES _ NUM      All-atoms   Total-Side   Main-Chain    Non-polar    All polar
REM                ABS   REL    ABS   REL    ABS   REL    ABS   REL    ABS   REL
RES ALA S   1   164.19 152.2  68.20 106.7  95.99 218.5  85.28 119.8  78.91 215.0
RES ARG S   2   237.97  99.8 196.20  99.8  41.76 100.1  77.06  99.8 160.91  99.9
RES ALA S   3   159.00 147.4  62.41  97.6  96.59 219.8  83.76 117.7  75.24 205.0

NACCESS z=0.05:

REM RES _ NUM      All-atoms   Total-Side   Main-Chain    Non-polar    All polar
REM                ABS   REL    ABS   REL    ABS   REL    ABS   REL    ABS   REL
RES ALA S   1   163.75 151.7  82.84 119.3  80.91 209.9  84.99 119.1  78.76 215.3
RES ARG S   2   239.14 100.2 201.08  99.9  38.07 101.5  77.20  99.2 161.95 100.6
RES ALA S   3   158.42 146.8  70.35 101.4  88.07 228.5  83.07 116.4  75.35 206.0

FreeSASA n=20:

REM RES _ NUM      All-atoms   Total-Side   Main-Chain    Non-polar    All polar
REM                ABS   REL    ABS   REL    ABS   REL    ABS   REL    ABS   REL
RES ALA S   1   163.75 151.8  68.03 106.4  95.72 217.8  84.99 119.4  78.76 214.5
RES ARG S   2   239.14 100.3 197.29 100.3  41.85 100.3  77.20 100.0 161.95 100.5
RES ALA S   3   158.42 146.8  62.31  97.4  96.11 218.7  83.07 116.7  75.35 205.3

NACCESS is off by a few tenths of a percent in both cases, so either the reference was calculated using some other z, or there could be some round off error involved, or something has changed in NACCESS between when the reference values were calculated and the version I'm using (2.1.1).

I tried z=0.005 and z=0.001 and got similar results: REL values were off by 0.1-0.2 percent.

Although being identical to legacy has its benefits, so does internal consistency, so I'm not sure what's best here.

JoaoRodrigues commented 7 years ago

I quickly checked the naccess.config file, I think you count CB as Main-Chain. Could you double check that? Otherwise, it's perfect (except for rounding errors here and there).

mittinatten commented 7 years ago

Main-chain is defined statically by the function freesasa_atom_is_backbone() https://github.com/mittinatten/freesasa/blob/master/src/classifier.c#L915

So it can't be defined in config-files at the moment. As you can ses from the code main-chain is C, N, O, CA and OXT. I think NACCESS excludes CA from backbone. Can't check right now, but that should be obvious if you run it for the Gly-tripeptide.

JoaoRodrigues commented 7 years ago

You are right, from the NACCESS README.

By default, alpha carbon atoms are considered to be part of the amino acid side chain so that glycines possess a relative side chain accessibility. To switch this off, use the -b option. WARNING: Please note that when using this option, the sidechain and mainchain %accessibilities for all residues will be wrong.

Interestingly, I also ran NACCESS on the tripeptides and I get slightly different relative accessibilities compared to the standard.data file but I cannot figure out where the little discrepancy comes from. Nitpicking anyway, I'd just recalculate standard.data based on your code and the original peptides, to be consistent.

mittinatten commented 7 years ago

Could possibly provide something similar to the -b option, although that would require some rewiring. Or, just make a note somewhere that FreeSASA with naccess-configuration corresponds to the naccess -b option, and that reference SASAs used to calculate REL were generated using the -b definition for main-chain/side-chain.

I checked all 20 reference PDBs and all five ABS columns are identical between FreeSASA and naccess with option -b. So the backbone definitions are under control, and the polar/apolar definitions are the same.

So aef643bbbaab12c2 already has reference values calculated from the new PDBs, and should be ready to use.

mittinatten commented 6 years ago

Closing this issue, the remaining tasks have been moved to #31. Will save them for a later release (2.0.3?).

mittinatten / freesasa

Relative Accessibilities #27