Closed stanleyjs closed 7 years ago
When you say "pre-image", I assume you meant the k-mer string. The short answer is to use the similar_by_vector
function: https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.similar_by_vector . You can call that by using mk_model.model(k_len_of_interest).similar_by_vector(your_vector)
.
Let me know whether that is what you are looking for. I will add more details in the README for that.
Yes, that is what I meant. Thank you! I hadn't seen the gensim page, this will be very helpful.
Additionally, I noticed that the pre-trained data set was missing some k-mers I wanted to vectorize. Is this just a function of the size of the training data? Ideally I would like to train on my target genome and it have every possible k-mer in the dna2vec keys.
Also, I come back to the component mappings. Is there some s_i -> {v_j,...v_k} correspondence such that weighted minkowski distances(for example) could be used in downstream analysis? or does each component of the original sequence s_1,...s_k contribute to every other component of the vectorized sequence v_1,...v_100?
The pre-trained data set should cover all k-mers for 3 <= k <= 8. What do you get when you try to run the following?
>>> [len(mk_model.model(k).vocab) for k in range(3,9)]
[64, 256, 1024, 4096, 16384, 65536]
>>> [4**k for k in range(3,9)]
[64, 256, 1024, 4096, 16384, 65536]
This is odd - my result is the same as yours yet when I ran my code yesterday I received a key error the 4-tuple 'CTAA'.
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-118-955472f6af60> in <module>()
----> 1 np.array(list(vectors))
/home/jay/dna2vec/dna2vec/multi_k_model.py in vector(self, vocab)
32
33 def vector(self, vocab):
---> 34 return self.data[len(vocab)].model[vocab]
35
36 def unitvec(self, vec):
/home/jay/.local/lib/python3.5/site-packages/gensim/models/word2vec.py in __getitem__(self, words)
1504 return self.syn0[self.vocab[words].index]
1505
-> 1506 return vstack([self.syn0[self.vocab[word].index] for word in words])
1507
1508 def __contains__(self, word):
/home/jay/.local/lib/python3.5/site-packages/gensim/models/word2vec.py in <listcomp>(.0)
1504 return self.syn0[self.vocab[words].index]
1505
-> 1506 return vstack([self.syn0[self.vocab[word].index] for word in words])
1507
1508 def __contains__(self, word):
KeyError: 'CTAA'
Today when I tried mk_model.vector('CTAA')
I did get a vector. strange
Seemed to be related to something I'm missing about generator comprehensions.
vectors = list(map(dna_2_vec.vector, (ele for ele in (lst for lst in map(lambda x: list(chunkstring(x,4)), targets)))))
where I'm essentially breaking up a list of longer sequences into 4-mers, then yielding each of those sets, and then each element of the set to the model.vector(x).
Doing something like this works:
vectors =map(lambda x: list(chunkstring(x,4)), targets)
output = []
for lst in vectors:
sublst = np.array([])
for ele in lst:
sublst = np.append(sublst, dna_2_vec.vector(ele))
sublst = np.reshape(np.array(sublst),(500,))
output.append(sublst)
output_vectors = np.array(output)
Feel free to mark this issue resolved.
Hi, Can you please make it explicit how to obtain a pre-image from a mapped vector? Additionally, can you explain how the components v_j of the vectors in V are related to the sequence components s_i in the sequence space S?
Best wishes