pablocarb / Latent

ML tools for synbio
MIT License
0 stars 0 forks source link

Sequence encoding #1

Open pablocarb opened 6 years ago

pablocarb commented 6 years ago

It seems that the encoding in the original version depends on the training set. In that case, we need a routine that can map any new sequence into the same feature space. I made some quick changes in the function “load_processed_biosensor_pablo”, but this needs to be improved.

The best would be to make the encoding of the sequence independent of the occurrences of each amino acid in the training set. Ideally, we need something like what is done in sklearn (see for instance the way the PCA transformation is calculated in sklearn.decomposition.PCA) so that we can transform any arbitrary input set. Just a simple solution, nothing too complicated. We don’t have that issue for chemicals because we are using folded fingerprints.

In my version of the code I am passing the previously stored vector d_to_index of the model when creating the test set so that hopefully the sequences are correctly encoded but needs to be verified:

Test_seq,Test_chemical=bn.load_processed_biosensor_pablo(cvtest,d_to_index)
dr413677671 commented 5 years ago

Sorry about that I just find out the issue recently. I think it might not be a problem because the vocab contains 21 amino acids : 20 normal + X(not sure what it means). The sequence will them embedded and padded into the same length of 1408, and automaticlly added 0 if one particular sequence is shorter than 1408.