pnpnpn / dna2vec

dna2vec: Consistent vector representations of variable-length k-mers
MIT License
182 stars 60 forks source link

Incorrect embedding dimension after training #26

Open BinchaoPeng opened 2 years ago

BinchaoPeng commented 2 years ago

I want to use dna2vec for E. coli genome. When I set 2<=k<=8, I got (86479,100); When I set 3<=k<8, I got (86614,100), and the correct dimension should be (87360,100) that $87360+16=4^2+4^3+4^4+4^5+4^6+4^7+4^8$. So I don' know why I got 2 different results. I also check every Kmer from 2 to 8, I find the dimension is correct from 2 to 7. However, in k=8, the dimension is (64450,100) rather than (65536,100), and $65536-64450 != 87630-86614$. This is horrible! There is nowhere to match.

BinchaoPeng commented 2 years ago

@pnpnpn please take a time to help me, it is important for me, thanks! The E. coli genome can be downloaded from https://regulondb.ccg.unam.mx/menu/download/datasets/files/Gene_sequence.txt. The config:

inputs: inputs/E_coli_K12/*.txt
k-low: 2
k-high: 8
vec-dim: 100
epoch: 10
context: 5
out-dir: results/E_coli/
BinchaoPeng commented 2 years ago

Today, I make a comparation between kmers of embedding vector and the complete kmer composition where 3<=k<=8. I find there are two difference sites:

  1. the occurence of some kmer compisitons is low frequency;
  2. there is no occurence of some kmer compisitons

So, I'd like to know if there is a better way to solve a such situation when I make a embedding operation. After all there are some kmer compositions lacking when embedding. Thanks! @pnpnpn @aldro61 @alevenberg

BinchaoPeng commented 2 years ago

I found that it seems to be related to the parameter min_count, but why doesn't the first dimension of the embedding vector obtained from 2<=k<=8 and 3<=k<=8 differ by 16?

eternal-bug commented 1 year ago

Maybe your problem has something to do with this place. When reading the source code, you will find that when extracting k-mers, they are not completely extracted from start to end of a sequence, but rather there is randomness:

generators.py

    @staticmethod
    def random_chunks(rng, li, min_chunk, max_chunk):
        """
        Both min_chunk and max_chunk are inclusive
        """
        it = iter(li)
        while True:
            head_it = islice(it, rng.randint(min_chunk, max_chunk + 1))
            nxt = '' . join(head_it)

            # throw out chunks that are not within the kmer range
            if len(nxt) >= min_chunk:
                yield nxt
            else:
                break

Because the human genome is relatively large, after random sampling, it is likely to obtain all combinations of k-mers. However, for the genome of E. coli, which is much smaller, the sample size is relatively small, so it is possible that some k-mers were not included in the statistics you mentioned earlier.