girke-lab / chemmine-ei

GNU General Public License v3.0
1 stars 0 forks source link

add FPset folding function #49

Closed khoran closed 10 years ago

khoran commented 11 years ago

Should the user be able to say how many bits are used for a particular fingerprint, or should we just always return all of them? Or should we set a max size ourself? The open babel function has a option to take the number of bits and fold the fingerprint down to that size.

Not sure whether it would be an advantage to do the folding with openbabel or downstream on the FPset generated with as(fpma, "FPset")? I remember we discussed this before whether we should have a generic fingerprint folding function in ChemmineR which would be easy to implement. Perhaps we could keep track of the fingerprint type and folding stage in the FPset object? This way one could avoid mixing of different types of fingerprints that would give nonsense results. Perhaps this would be the better solution? My suggestion would be to use the OpenBabel fingerprints as they are and then add a folding function. The latter should go on our to-do list anyways.

khoran commented 10 years ago

For fingerprints we can stick to constant length for all entries in an FPset. The print function for FPset/FP already returns the number of bits but I believe this is computed on the fly with dim(). To address the folding level, I suggest to add a character vector slot to the FPset/FP classes. The constructors of the object should populate this "FP type" vector for tracking purposes that could also be used to issue warnings when the user appends fingerprints that don't have the exact same values stored in the "FP type" vector. This vector could contain:

Fingerprint_Type = some_default_value_that_can_be_set_by_user* Folding_level = default_is_0 Other = not_sure_if_anything_else_is_needed?

Fingerprint_Type could be automatically populated if by the different fingerprint functions in ChemmineR. For instance, "APfp" for atom pair fingerprints and MACS and other for those computed by OpenBabel. For those, imported from external files, the user would be asked to provide this value.

Thomas

On Tue, Dec 03, 2013 at 11:47:03PM +0000, Kevin Horan wrote:

Thomas, I was looking into this problem again. I can create a folding function for FP and FPset easily. Do you want to say that all FPs in an FPset must have the same number of bits? If not we would have to abandon the current matrix representation and store a list of FPs. Probably better not to do that. It would also mean a restriction on what FPs can be added to an FPset (ie, they must have the same number of bits). Do you think it would be important to keep track of how many times an FP/FPset has been folded, or is it just the current number of bits that matters? Here are some proposed names and use cases, let me know what you want to add/change.

  1. fold an FPset n times: foldFP(fpset,n)
  2. fold an FPset down to m bits: foldFP(fpset, bits=m) # throw an exception if m is not possible
  3. fold an FP n times or down to m bits: foldFP(fp,n) ; foldFP(fp,bits=m)
  4. ask how many bits an FP/FPset has: numBits(fpset)
  5. ask how many times an FP/FPset has been folded (needed?): numFolds(fpset)
  6. add FP to FPset: c(fpset,fp) #throw exception if number of bits is different