pnpnpn / dna2vec

dna2vec: Consistent vector representations of variable-length k-mers
MIT License
179 stars 60 forks source link

Encoding longer sequences #21

Open ashenflower opened 3 years ago

ashenflower commented 3 years ago

Is there already any implemented function to encode longer sequences (such as sequencing reads) using their k-mers embeddings?

luciabarb commented 3 years ago

I have the same question, did you get any reply?

ashenflower commented 3 years ago

I have the same question, did you get any reply?

No, unfortunately...I guess you should sum the embeddings of all the k-mers of a read to get its final embedding

EspinosaLeal commented 2 years ago

Did it work to sum the embeddings of different k-mers?

ashenflower commented 2 years ago

Did it work to sum the embeddings of different k-mers?

I'm sorry for the delay! I didn't try anymore, but I think it can make sense, it would be similar to work with word2vec.

eternal-bug commented 1 year ago

I think need to do something like average pooling for a sequence kmer vecs, may be like this:

import numpy as np
from dna2vec.multi_k_model import MultiKModel

filepath = 'pretrained/dna2vec-20161219-0153-k3to8-100d-10c-29320Mbp-sliding-Xat.w2v'
mk_model = MultiKModel(filepath)

def get_kmer(seq, k):
    # extract kmer function...

seq = "AGCTACG......"

vecs = [ np.array(mk_model.vector(k)) for kmer in get_kmer(sequence, 3) ]
# get the 100-dim average vec for a different length sequence
vec_pool = np.mean(vecs, axis=0)