The current extractDC is not symmetric, which generates 400 keys. This has some drawbacks:
Proteins smaller than 400 AA: cannot contain all keys;
Proteins between 400 - 1000 AA: significant number of keys will still have counts of 0 or 1;
It may be wise to implement a symmetric descriptor, where "XY" == "YX":
Statistical power: is likely to increase (as most counts will increase);
I feel that there are no functional differences between "XY" and "YX" at protein level;
The symmetric variant would have 210 keys instead of 400 keys, e.g. "AA", "AC", "AD", ..., "XY", with "X" letter before "Y"-letter. The proprotions could be normalized by dividing to (2*n-2), where n = number of AA in the protein.
It would be interesting to compare this descriptor against the current extractDC on real-life protein data sets.
New Protein Descriptor: Symmetric extractDC
The current extractDC is not symmetric, which generates 400 keys. This has some drawbacks:
It may be wise to implement a symmetric descriptor, where "XY" == "YX":
The symmetric variant would have 210 keys instead of 400 keys, e.g. "AA", "AC", "AD", ..., "XY", with "X" letter before "Y"-letter. The proprotions could be normalized by dividing to (2*n-2), where n = number of AA in the protein.
It would be interesting to compare this descriptor against the current extractDC on real-life protein data sets.