Closed lzamparo closed 7 years ago
Also, I'll update DatasetReader to vary the order in which files are processed
After varying both values of K and the stride, it seems the inability to learn is a function both of K being large, and stride being too large. Will produce a fig with the results of my experiments.
Also, a replication in the TF implementation of w2v would help convince me that the present implementation is working.
Looks like we can learn if informative kmers are close enough to each other; so keep stride low, and maybe keep K low so as not to drop too many useful tokens.
I'm seeing a disturbing pattern of not seeing better performance by epoch.
Different macrobatches are showing better or worse performance, but they are very consistent with which files make up the macro-batches
Below is a trace of the output from a sample embedding, trained on the positive probes that pass Han's QC stages. It's clear that across epochs, the macrobatches which cover the same files in the data yield an average loss which does not decrease in an epoch dependent manner:
This could be due to lots of things, but mostly it's probably a failure of the DatasetReader to parse the data in a format which is true to what the model (either CBOW or skip-gram) actually expects. This is a problem. I'll try testing with a gensim implementation.