dna2vec against large dataset

pnpnpn / dna2vec

dna2vec: Consistent vector representations of variable-length k-mers

MIT License

182 stars 60 forks source link

dna2vec against large dataset #9

Closed dmachi closed 5 years ago

dmachi commented 6 years ago

We are trying to run dna2vec against a large db (the ncbi nt dataset) which has ~47m sequences in it. Do you know of any issues with doing this (aside from it taking a really long time)?

I am seeing that the PROGRESS message report we are on sentence 105m, but we are still on epoch #1. I think we should be on epoch 3 based on the progress messages.

Do you have any thoughts on why this might be the case?

pnpnpn commented 6 years ago

The progress is generated by gensim, which is different from your actual number of sequences. The dna2vec algorithm is doing internal splits, see: https://github.com/pnpnpn/dna2vec/blob/master/dna2vec/generators.py#L119