uclnlp / jack

Jack the Reader
MIT License
257 stars 82 forks source link

Calculating precise Test Coverage #262

Closed dirkweissenborn closed 6 years ago

riedelcastro commented 6 years ago

This issue is a little underspecified. Hard to say when we have achieved it. @ninjin do you have a sense where coverage is missing right now?

pminervini commented 6 years ago

Giving a look into this right now.. (it's a chance of also addressing #257 and #265)

TimDettmers commented 6 years ago

There are software modules which automagically determine test coverage. I think we have the most important parts covered with the overfit and smalldata tests. They tests are sensitive to any change which changes the model result and will give a good indicator that problems exists. The location of the problems may not be obvious though but it gives enough safety to not regress on model performance.

The question would thus turn into: Do we have overfit/smalldata tests for all models?

I think this should be the main aim.

ghost commented 6 years ago

Codecov would allow use to track things automagically, for one private repo for free. Setting it up is easy.

pminervini commented 6 years ago

How do we work on maximising coverage? Is it by adding test cases?

ghost commented 6 years ago

@pminervini The easiest way is to look at the Codecov.io report, find functions without coverage, and then write tests for them.

riedelcastro commented 6 years ago

Are we confident that the reports are accurate?

I might be reading this incorrectly, but looking at, say

https://codecov.io/gh/uclmr/jack/src/master/jack/core/shared_resources.py

it seems that store is not called (seeing as its body is red), even though it's clearly called here:

https://github.com/uclmr/jack/blob/master/tests/jack/test_core.py

Also, how does the fact that some of our regression tests call train_reader via an OS process (and not directly through the API) affect the coverage calculation? Does the coverage tool figure this out?

TimDettmers commented 6 years ago

The reports are not accurate because the integration tests are run through process invocations rather than python code. It might be a good investment to fix this so we can gauge our test coverage. I might have a look at this later.

riedelcastro commented 6 years ago

Yeah. But https://github.com/uclmr/jack/blob/master/tests/jack/test_core.py runs through normal pytest, and it still looks like it's not captured.

riedelcastro commented 6 years ago

@TimDettmers I agree this would be a good investment. Not just for testing, but also because it would force us to decouple the training logic from the "calling a command line script" logic. There were several situations where I wanted to do what train_reader does, but programmatically through the API. There is reader.train(), but it does substantially less.

TimDettmers commented 6 years ago

I wrote code for it which works, but things get difficult if you run successive TensorFlow graphs in a process (I remembered that was an issue in the first place, that is why I fell back on spawning new processes). I run on either errors due to some missing variables after I reset the graph or some errors due to wrong score due to variables that were not deleted. I tried scoping (even scoping before the imports) and I tried resetting the graph, but it still runs on these errors.

@dirkweissenborn maybe you know more about TensorFlow to fix this? I am at the end of my wits. The branch is e3ce9dd3615332646c3b826f30cac0fd063db942. Can you have a look?

dirkweissenborn commented 6 years ago

@TimDettmers can you post the errors you encounter?

TimDettmers commented 6 years ago

The errors are (1) without tf.reset_default_graph()

E           AssertionError: Metric value different from expected results!
E           assert False
E            +  where False = <function allclose at 0x7fa7e9e8a840>([0.0], [0.059039999], atol=0.05)
E            +    where <function allclose at 0x7fa7e9e8a840> = np.allclose

and (2) with tf.reset_default_graph()

       InvalidArgumentError (see above for traceback): indices[0] = -8388608 is not in [0, 28404)
E            [[Node: fastqa_reader_False/fast_qa/Gather_3 = Gather[Tindices=DT_INT32, Tparams=DT_FLOAT, validate_indices=true, _device="/job:localhost/replica:0/task:0/cpu:0"](fastqa_reader_False/fast_qa/Reshape_2, fastqa_reader_False/fast_qa/add_4)]]

The second one seems to be a vocab error, but I looked at it and could not figure it out.

dirkweissenborn commented 6 years ago

@TimDettmers

In vocab = Vocab(emb=emb, init_from_embeddings=False)

init_from_embeddings needs to be true for fastqa like readers. in bin/jack-train.py it is:

vocab = Vocab(emb=emb, init_from_embeddings=vocab_from_embeddings)

where vocab_from_embeddings is a configuration arg.

TimDettmers commented 6 years ago

I tried this just now with some other parameters from the fastqa.yaml but that did not seem to work out. Still getting index errors for the offset it seems

      File "/home/tim/git/jack/jack/core/model_module.py", line 166, in setup
E           self.shared_resources, *[self._tensors[port] for port in self.input_ports])
E         File "/home/tim/git/jack/jack/readers/extractive_qa/fastqa.py", line 156, in create_output
E           beam_size=shared_vocab_config.config.get("beam_size", 1))
E         File "/home/tim/git/jack/jack/readers/extractive_qa/fastqa.py", line 226, in fastqa_answer_layer
E           u_s = tf.gather(support_states_flat, start_pointer + offsets)
E         File "/home/tim/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 2409, in gather
E           validate_indices=validate_indices, name=name)
E         File "/home/tim/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 1219, in gather
E           validate_indices=validate_indices, name=name)
E         File "/home/tim/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
E           op_def=op_def)
E         File "/home/tim/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2630, in create_op
E           original_op=self._default_original_op, op_def=op_def)
E         File "/home/tim/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1204, in __init__
E           self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access
E       
E       InvalidArgumentError (see above for traceback): indices[0] = -2147483648 is not in [0, 28404)
E            [[Node: fastqa_reader_False/fast_qa/Gather_3 = Gather[Tindices=DT_INT32, Tparams=DT_FLOAT, validate_indices=true, _device="/job:localhost/replica:0/task:0/cpu:0"](fastqa_reader_False/fast_qa/Reshape_2, fastqa_reader_False/fast_qa/add_4)]]

For BiDAF I get errors that is basically trains too slowly. I have no idea where that is coming from, but seems to be a config issue since running them normally from command line works just fine. Any ideas?

undwie commented 6 years ago

I think I know what the reason for my problem is: currently our wercker script makes three separate calls to pytest. Each of them overwrites the coverage information with only information from the last. It does not aggregate coverage. I am trying to fix this now by running all testing in one call.

TimDettmers commented 6 years ago

@undwie nice catch! @dirkweissenborn do you have any insight into my error above? With that, we could get an even more accurate estimate of coverage.

TimDettmers commented 6 years ago

I will try to bugfix the errors and failing that close the issue.

TimDettmers commented 6 years ago

The bug still persists and I nobody does not how to fix it. Since this approach has failed and no progress can be made I close this issue.