Closed lzamparo closed 7 years ago
Via the commit last night:
Still need to figure out (1) how to pass a parsing function to w2v without breaking the inferace. (2) Also, tests.
For (1), I can try:
DatasetReader.parse
calls the appropriately constructed SequenceParser.parse
method, and the SequenceParser
is build by the driver script, passed to the DatasetReader
constructor. <<--- vastly preferredFor (2), I can re-run the test suite for DatasetReader
, but also make sure that a handful of hand-crafted SELEX probes get parsed properly.
Checked in code to implement (2), now need to focus on aliasing the tokens:
For token aliasing, I should extend the UnigramDictionary and TokenMap so that instead of mapping tokens to IDs, they map tokens to tuples (token, RC(token)), and then tuples to IDs.
This means extending both the UnigramDictionary
class and TokenMap
class:
SeqTokenMap
class extends TokenMap
, and maps tokens to tuples, and tuples to IDs.SeqUnigramDictionary
extends TokenMap, but has a SeqTokenMap
instead of a TokenMap
member.This will be more expensive storage-wise, but only slightly, since the number of tuples will be roughly half the number of tokens in the the dictionary.
Rewrote a lot of the CounterSampler
, UnigramDictionary
and TokenMap
code so that there now exist classes specifically for sequence data.
Need to run end-to-end testing to see if models still work, but for now it looks as if indexing tokens by themselves works; the key is separating the processes of sampling in sentences from sampling from the unigram dictionary.
Specifically, need to test that the correct form of SequenceParser.parse()
is getting propagated to the dataset_reader.generate_dataset_worker
processes.
Tests are in, looks like everything is working, but macrobatches are produced much more slowly in the rc_token
branch. Profiling now to see where all the time gets spent.
Profiling results are in: the bottleneck is re-creating a list of token positions from which to sample each time. Here are the profiling results before:
Ordered by: internal time, call count List reduced from 74 to 40 due to restriction <40>
ncalls tottime percall cumtime percall filename:lineno(function) 2160405 2385.657 0.001 2385.657 0.001 unigram_dictionary.py:223(get_token) 2160405 19.890 0.000 2440.572 0.001 unigram_dictionary.py:314(sample) 2448459 14.016 0.000 14.016 0.000 embedding_utils.py:53(ensure_str) 2160426/2160405 9.229 0.000 11.182 0.000 counter_sampler.py:292(sample) 144027 6.856 0.000 2447.428 0.017 dataset_reader.py:685(
) 2448459 6.251 0.000 20.267 0.000 token_map.py:109(get_id) 2448459 5.802 0.000 26.069 0.000 unigram_dictionary.py:206(get_id) 144028 1.988 0.000 2457.896 0.017 dataset_reader.py:634(generate_examples) 144027 1.517 0.000 2.961 0.000 dataset_reader.py:696(do_discard)
Fixed this by keeping a fixed list of tokens from which to sample positionally in the UnigramDictionary
, that should be kept up to date. The profiling results after:
ncalls tottime percall cumtime percall filename:lineno(function) 2160405 12.373 0.000 38.797 0.000 unigram_dictionary.py:335(sample) 2448459 9.449 0.000 9.449 0.000 embedding_utils.py:53(ensure_str) 2160426/2160405 7.455 0.000 8.915 0.000 counter_sampler.py:292(sample) 144027 5.807 0.000 44.603 0.000 dataset_reader.py:685(
) 2448459 4.521 0.000 13.969 0.000 token_map.py:109(get_id) 2448459 3.369 0.000 17.339 0.000 unigram_dictionary.py:221(get_id) 2160405 2.203 0.000 2.203 0.000 unigram_dictionary.py:238(get_token) 144028 1.553 0.000 52.914 0.000 dataset_reader.py:634(generate_examples) 6913359 0.952 0.000 0.952 0.000 {built-in method builtins.hasattr}
All unit-tests pass, going to try an end-to-end test.
End-to-end test passed, merging.
Not sure of the easiest way to represent this, but probably a very small class with primary & RC representations, which is used to key the unigram dict, and also to index the embedding & decoding matrices.