dna2vec: Consistent vector representations of variable-length k-mers
Pre-image / component mapping? #3

Closed stanleyjs closed 7 years ago

stanleyjs commented 7 years ago

Hi, Can you please make it explicit how to obtain a pre-image from a mapped vector? Additionally, can you explain how the components v_j of the vectors in V are related to the sequence components s_i in the sequence space S?

pnpnpn commented 7 years ago

When you say "pre-image", I assume you meant the k-mer string. The short answer is to use the similar_by_vector function: . You can call that by using mk_model.model(k_len_of_interest).similar_by_vector(your_vector).

Let me know whether that is what you are looking for. I will add more details in the README for that.

stanleyjs commented 7 years ago

Yes, that is what I meant. Thank you! I hadn't seen the gensim page, this will be very helpful.

Additionally, I noticed that the pre-trained data set was missing some k-mers I wanted to vectorize. Is this just a function of the size of the training data? Ideally I would like to train on my target genome and it have every possible k-mer in the dna2vec keys.

stanleyjs commented 7 years ago

Also, I come back to the component mappings. Is there some s_i -> {v_j,...v_k} correspondence such that weighted minkowski distances(for example) could be used in downstream analysis? or does each component of the original sequence s_1,...s_k contribute to every other component of the vectorized sequence v_1,...v_100?

pnpnpn commented 7 years ago

The pre-trained data set should cover all k-mers for 3 <= k <= 8. What do you get when you try to run the following?

>>> [len(mk_model.model(k).vocab) for k in range(3,9)]
[64, 256, 1024, 4096, 16384, 65536]
>>> [4**k for k in range(3,9)]
[64, 256, 1024, 4096, 16384, 65536]
stanleyjs commented 7 years ago

This is odd - my result is the same as yours yet when I ran my code yesterday I received a key error the 4-tuple 'CTAA'.

Today when I tried mk_model.vector('CTAA') I did get a vector. strange

stanleyjs commented 7 years ago

Seemed to be related to something I'm missing about generator comprehensions. vectors = list(map(dna_2_vec.vector, (ele for ele in (lst for lst in map(lambda x: list(chunkstring(x,4)), targets))))) where I'm essentially breaking up a list of longer sequences into 4-mers, then yielding each of those sets, and then each element of the set to the model.vector(x).

Doing something like this works:

vectors =map(lambda x: list(chunkstring(x,4)), targets)
output = []
for lst in vectors:
    sublst = np.array([])
    for ele in lst:
        sublst = np.append(sublst, dna_2_vec.vector(ele))
    sublst = np.reshape(np.array(sublst),(500,))
output_vectors = np.array(output)

