Did some preliminary refactoring. Major differences are use of new data structures and removal of random file testing. I replaced the random testing infrastructure with a function that will yield random (still in order) entries from the large file as OffersCorpusEntry objects. Kind of hard to test a random process since I don't know what will be tested...
There was a random sample simply to avoid biasing toward a single site's entries grouped in the corpus. You could accomplish the same thing by jumping around in the corpus in a deterministic way.
Did some preliminary refactoring. Major differences are use of new data structures and removal of random file testing. I replaced the random testing infrastructure with a function that will yield random (still in order) entries from the large file as OffersCorpusEntry objects. Kind of hard to test a random process since I don't know what will be tested...