Merge unigram words, code-words with their reverse complements

lzamparo commented 7 years ago

Not sure of the easiest way to represent this, but probably a very small class with primary & RC representations, which is used to key the unigram dict, and also to index the embedding & decoding matrices.

lzamparo commented 7 years ago

Via the commit last night:

Still need to figure out (1) how to pass a parsing function to w2v without breaking the inferace. (2) Also, tests.

For (1), I can try:

Instantiating a SequenceParser in the driver script, passing in the appropriate parsing method for use later in dataset_reader.py
Change the interface so that DatasetReader.parse calls the appropriately constructed SequenceParser.parse method, and the SequenceParser is build by the driver script, passed to the DatasetReader constructor. <<--- vastly preferred

For (2), I can re-run the test suite for DatasetReader, but also make sure that a handful of hand-crafted SELEX probes get parsed properly.

lzamparo commented 7 years ago

Checked in code to implement (2), now need to focus on aliasing the tokens:

~~Tokens are parsed into a SeqToken, which wraps the token and its RC, and is created happens during parsing of documents or probes~~
Ensure each token and its RC get mapped to the same token ID
Ensure that the pruning of the tree removes tokens that are not seen

lzamparo commented 7 years ago

For token aliasing, I should extend the UnigramDictionary and TokenMap so that instead of mapping tokens to IDs, they map tokens to tuples (token, RC(token)), and then tuples to IDs.

This means extending both the UnigramDictionary class and TokenMap class:

SeqTokenMap class extends TokenMap, and maps tokens to tuples, and tuples to IDs.
SeqUnigramDictionary extends TokenMap, but has a SeqTokenMap instead of a TokenMap member.

This will be more expensive storage-wise, but only slightly, since the number of tuples will be roughly half the number of tokens in the the dictionary.

lzamparo commented 7 years ago

Rewrote a lot of the CounterSampler, UnigramDictionary and TokenMap code so that there now exist classes specifically for sequence data.

Need to run end-to-end testing to see if models still work, but for now it looks as if indexing tokens by themselves works; the key is separating the processes of sampling in sentences from sampling from the unigram dictionary.

Specifically, need to test that the correct form of SequenceParser.parse() is getting propagated to the dataset_reader.generate_dataset_worker processes.

lzamparo commented 7 years ago

Tests are in, looks like everything is working, but macrobatches are produced much more slowly in the rc_token branch. Profiling now to see where all the time gets spent.

lzamparo commented 7 years ago

Profiling results are in: the bottleneck is re-creating a list of token positions from which to sample each time. Here are the profiling results before:

Ordered by: internal time, call count List reduced from 74 to 40 due to restriction <40>

ncalls tottime percall cumtime percall filename:lineno(function) 2160405 2385.657 0.001 2385.657 0.001 unigram_dictionary.py:223(get_token) 2160405 19.890 0.000 2440.572 0.001 unigram_dictionary.py:314(sample) 2448459 14.016 0.000 14.016 0.000 embedding_utils.py:53(ensure_str) 2160426/2160405 9.229 0.000 11.182 0.000 counter_sampler.py:292(sample) 144027 6.856 0.000 2447.428 0.017 dataset_reader.py:685() 2448459 6.251 0.000 20.267 0.000 token_map.py:109(get_id) 2448459 5.802 0.000 26.069 0.000 unigram_dictionary.py:206(get_id) 144028 1.988 0.000 2457.896 0.017 dataset_reader.py:634(generate_examples) 144027 1.517 0.000 2.961 0.000 dataset_reader.py:696(do_discard)

Fixed this by keeping a fixed list of tokens from which to sample positionally in the UnigramDictionary, that should be kept up to date. The profiling results after:

ncalls tottime percall cumtime percall filename:lineno(function) 2160405 12.373 0.000 38.797 0.000 unigram_dictionary.py:335(sample) 2448459 9.449 0.000 9.449 0.000 embedding_utils.py:53(ensure_str) 2160426/2160405 7.455 0.000 8.915 0.000 counter_sampler.py:292(sample) 144027 5.807 0.000 44.603 0.000 dataset_reader.py:685() 2448459 4.521 0.000 13.969 0.000 token_map.py:109(get_id) 2448459 3.369 0.000 17.339 0.000 unigram_dictionary.py:221(get_id) 2160405 2.203 0.000 2.203 0.000 unigram_dictionary.py:238(get_token) 144028 1.553 0.000 52.914 0.000 dataset_reader.py:634(generate_examples) 6913359 0.952 0.000 0.952 0.000 {built-in method builtins.hasattr}

lzamparo commented 7 years ago

All unit-tests pass, going to try an end-to-end test.

lzamparo commented 7 years ago

End-to-end test passed, merging.

lzamparo / embedding

Merge unigram words, code-words with their reverse complements #8