Closed dmachi closed 5 years ago
The progress is generated by gensim, which is different from your actual number of sequences. The dna2vec algorithm is doing internal splits, see: https://github.com/pnpnpn/dna2vec/blob/master/dna2vec/generators.py#L119
We are trying to run dna2vec against a large db (the ncbi nt dataset) which has ~47m sequences in it. Do you know of any issues with doing this (aside from it taking a really long time)?
I am seeing that the PROGRESS message report we are on sentence 105m, but we are still on epoch #1. I think we should be on epoch 3 based on the progress messages.
Do you have any thoughts on why this might be the case?