Sequence encoding - Githubissues

It seems that the encoding in the original version depends on the training set. In that case, we need a routine that can map any new sequence into the same feature space. I made some quick changes in the function “load_processed_biosensor_pablo”, but this needs to be improved.

The best would be to make the encoding of the sequence independent of the occurrences of each amino acid in the training set. Ideally, we need something like what is done in sklearn (see for instance the way the PCA transformation is calculated in sklearn.decomposition.PCA) so that we can transform any arbitrary input set. Just a simple solution, nothing too complicated. We don’t have that issue for chemicals because we are using folded fingerprints.

In my version of the code I am passing the previously stored vector d_to_index of the model when creating the test set so that hopefully the sequences are correctly encoded but needs to be verified:

Test_seq,Test_chemical=bn.load_processed_biosensor_pablo(cvtest,d_to_index)

pablocarb / Latent

Sequence encoding #1