Open ashenflower opened 3 years ago
I have the same question, did you get any reply?
I have the same question, did you get any reply?
No, unfortunately...I guess you should sum the embeddings of all the k-mers of a read to get its final embedding
Did it work to sum the embeddings of different k-mers?
Did it work to sum the embeddings of different k-mers?
I'm sorry for the delay! I didn't try anymore, but I think it can make sense, it would be similar to work with word2vec.
I think need to do something like average pooling for a sequence kmer vecs, may be like this:
import numpy as np
from dna2vec.multi_k_model import MultiKModel
filepath = 'pretrained/dna2vec-20161219-0153-k3to8-100d-10c-29320Mbp-sliding-Xat.w2v'
mk_model = MultiKModel(filepath)
def get_kmer(seq, k):
# extract kmer function...
seq = "AGCTACG......"
vecs = [ np.array(mk_model.vector(k)) for kmer in get_kmer(sequence, 3) ]
# get the 100-dim average vec for a different length sequence
vec_pool = np.mean(vecs, axis=0)
Is there already any implemented function to encode longer sequences (such as sequencing reads) using their k-mers embeddings?