pnpnpn / dna2vec

dna2vec: Consistent vector representations of variable-length k-mers
MIT License
179 stars 60 forks source link

Pre-image / component mapping? #3

Closed stanleyjs closed 7 years ago

stanleyjs commented 7 years ago

Hi, Can you please make it explicit how to obtain a pre-image from a mapped vector? Additionally, can you explain how the components v_j of the vectors in V are related to the sequence components s_i in the sequence space S?

Best wishes

pnpnpn commented 7 years ago

When you say "pre-image", I assume you meant the k-mer string. The short answer is to use the similar_by_vector function: https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.similar_by_vector . You can call that by using mk_model.model(k_len_of_interest).similar_by_vector(your_vector).

Let me know whether that is what you are looking for. I will add more details in the README for that.

stanleyjs commented 7 years ago

Yes, that is what I meant. Thank you! I hadn't seen the gensim page, this will be very helpful.

Additionally, I noticed that the pre-trained data set was missing some k-mers I wanted to vectorize. Is this just a function of the size of the training data? Ideally I would like to train on my target genome and it have every possible k-mer in the dna2vec keys.

stanleyjs commented 7 years ago

Also, I come back to the component mappings. Is there some s_i -> {v_j,...v_k} correspondence such that weighted minkowski distances(for example) could be used in downstream analysis? or does each component of the original sequence s_1,...s_k contribute to every other component of the vectorized sequence v_1,...v_100?

pnpnpn commented 7 years ago

The pre-trained data set should cover all k-mers for 3 <= k <= 8. What do you get when you try to run the following?

>>> [len(mk_model.model(k).vocab) for k in range(3,9)]
[64, 256, 1024, 4096, 16384, 65536]
>>> [4**k for k in range(3,9)]
[64, 256, 1024, 4096, 16384, 65536]
stanleyjs commented 7 years ago

This is odd - my result is the same as yours yet when I ran my code yesterday I received a key error the 4-tuple 'CTAA'.

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-118-955472f6af60> in <module>()
----> 1 np.array(list(vectors))

/home/jay/dna2vec/dna2vec/multi_k_model.py in vector(self, vocab)
     32 
     33     def vector(self, vocab):
---> 34         return self.data[len(vocab)].model[vocab]
     35 
     36     def unitvec(self, vec):

/home/jay/.local/lib/python3.5/site-packages/gensim/models/word2vec.py in __getitem__(self, words)
   1504             return self.syn0[self.vocab[words].index]
   1505 
-> 1506         return vstack([self.syn0[self.vocab[word].index] for word in words])
   1507 
   1508     def __contains__(self, word):

/home/jay/.local/lib/python3.5/site-packages/gensim/models/word2vec.py in <listcomp>(.0)
   1504             return self.syn0[self.vocab[words].index]
   1505 
-> 1506         return vstack([self.syn0[self.vocab[word].index] for word in words])
   1507 
   1508     def __contains__(self, word):

KeyError: 'CTAA'

Today when I tried mk_model.vector('CTAA') I did get a vector. strange

stanleyjs commented 7 years ago

Seemed to be related to something I'm missing about generator comprehensions. vectors = list(map(dna_2_vec.vector, (ele for ele in (lst for lst in map(lambda x: list(chunkstring(x,4)), targets))))) where I'm essentially breaking up a list of longer sequences into 4-mers, then yielding each of those sets, and then each element of the set to the model.vector(x).

Doing something like this works:

vectors =map(lambda x: list(chunkstring(x,4)), targets)
output = []
for lst in vectors:
    sublst = np.array([])
    for ele in lst:
        sublst = np.append(sublst, dna_2_vec.vector(ele))
    sublst = np.reshape(np.array(sublst),(500,))
    output.append(sublst)
output_vectors = np.array(output)

Feel free to mark this issue resolved.