Open BinchaoPeng opened 2 years ago
@pnpnpn please take a time to help me, it is important for me, thanks! The E. coli genome can be downloaded from https://regulondb.ccg.unam.mx/menu/download/datasets/files/Gene_sequence.txt. The config:
inputs: inputs/E_coli_K12/*.txt
k-low: 2
k-high: 8
vec-dim: 100
epoch: 10
context: 5
out-dir: results/E_coli/
Today, I make a comparation between kmers of embedding vector and the complete kmer composition where 3<=k<=8
.
I find there are two difference sites:
So, I'd like to know if there is a better way to solve a such situation when I make a embedding operation. After all there are some kmer compositions lacking when embedding. Thanks! @pnpnpn @aldro61 @alevenberg
I found that it seems to be related to the parameter min_count, but why doesn't the first dimension of the embedding vector obtained from 2<=k<=8 and 3<=k<=8 differ by 16?
Maybe your problem has something to do with this place. When reading the source code, you will find that when extracting k-mers, they are not completely extracted from start to end of a sequence, but rather there is randomness:
@staticmethod
def random_chunks(rng, li, min_chunk, max_chunk):
"""
Both min_chunk and max_chunk are inclusive
"""
it = iter(li)
while True:
head_it = islice(it, rng.randint(min_chunk, max_chunk + 1))
nxt = '' . join(head_it)
# throw out chunks that are not within the kmer range
if len(nxt) >= min_chunk:
yield nxt
else:
break
Because the human genome is relatively large, after random sampling, it is likely to obtain all combinations of k-mers. However, for the genome of E. coli, which is much smaller, the sample size is relatively small, so it is possible that some k-mers were not included in the statistics you mentioned earlier.
I want to use dna2vec for E. coli genome. When I set
2<=k<=8
, I got(86479,100)
; When I set3<=k<8
, I got(86614,100)
, and the correct dimension should be(87360,100)
that $87360+16=4^2+4^3+4^4+4^5+4^6+4^7+4^8$. So I don' know why I got 2 different results. I also check every Kmer from 2 to 8, I find the dimension is correct from 2 to 7. However, ink=8
, the dimension is(64450,100)
rather than(65536,100)
, and $65536-64450 != 87630-86614$. This is horrible! There is nowhere to match.