New Protein Descriptor: Symmetric extractDC

The current extractDC is not symmetric, which generates 400 keys. This has some drawbacks:

Proteins smaller than 400 AA: cannot contain all keys;
Proteins between 400 - 1000 AA: significant number of keys will still have counts of 0 or 1;

It may be wise to implement a symmetric descriptor, where "XY" == "YX":

Statistical power: is likely to increase (as most counts will increase);
I feel that there are no functional differences between "XY" and "YX" at protein level;

The symmetric variant would have 210 keys instead of 400 keys, e.g. "AA", "AC", "AD", ..., "XY", with "X" letter before "Y"-letter. The proprotions could be normalized by dividing to (2*n-2), where n = number of AA in the protein.

It would be interesting to compare this descriptor against the current extractDC on real-life protein data sets.

nanxstats / protr

New Protein Descriptor: Symmetric extractDC #44

New Protein Descriptor: Symmetric extractDC