Tokenizer issues / errors and live_main.py capabilities

I am trying to get Macaw to work, expecting results similar to Figure 1 (b) in the paper. Currently, I am working based on a clean ubuntu:bionic Docker image, because it provides default Python 3.6 and lets me install Java 8 (for Stanford CoreNLP). Long story short, I am able to run python3 live_main.py and arrive at the ENTER COMMAND: prompt with stdio as the interface.

Firstly, the default simple tokenizer as set in drqa_mrc.py causes the following error (but of course does not affect retrieval of a list of URLs from Bing):

Macaw Logger - 2020-02-26 15:54:58,964 - INFO - New query: who is barack obama
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/macaw-0.1-py3.6.egg/macaw/core/input_handler/actions.py", line 104, in run_action
    return_dict[action] = func_timeout(params['timeout'], action_func, args=[conv_list, params])
  File "/usr/local/lib/python3.6/dist-packages/func_timeout/dafunc.py", line 108, in func_timeout
    raise_exception(exception)
  File "/usr/local/lib/python3.6/dist-packages/func_timeout/py3_raise.py", line 7, in raise_exception
    raise exception[0] from None
  File "/usr/local/lib/python3.6/dist-packages/macaw-0.1-py3.6.egg/macaw/core/input_handler/actions.py", line 81, in run
    return params['actions']['qa'].get_results(conv_list, doc)
  File "/usr/local/lib/python3.6/dist-packages/macaw-0.1-py3.6.egg/macaw/core/mrc/drqa_mrc.py", line 76, in get_results
    predictions = self.predictor.predict(doc, q, None, self.params['qa_results_requested'])
  File "/root/DrQA/drqa/reader/predictor.py", line 88, in predict
    results = self.predict_batch([(document, question, candidates,)], top_n)
  File "/root/DrQA/drqa/reader/predictor.py", line 128, in predict_batch
    batch_exs = batchify([vectorize(e, self.model) for e in examples])
  File "/root/DrQA/drqa/reader/predictor.py", line 128, in <listcomp>
    batch_exs = batchify([vectorize(e, self.model) for e in examples])
  File "/root/DrQA/drqa/reader/vector.py", line 33, in vectorize
    q_lemma = {w for w in ex['qlemma']} if args.use_lemma else None
TypeError: 'NoneType' object is not iterable
THE RESPONSE STARTS
----------------------------------------------------------------------
#get_doc https://www.biography.com/us-president/barack-obama  |  Barack Obama - U.S. Presidency, Education &amp; Family - Biography
#get_doc https://en.wikipedia.org/wiki/Barack_Obama  |  Barack Obama - Wikipedia
#get_doc https://www.britannica.com/biography/Barack-Obama  |  Barack Obama | Biography, Presidency, &amp; Facts | Britannica
----------------------------------------------------------------------
THE RESPONSE STARTS

... which makes sense, given the earlier warning from DrQA stating:

WARNING:drqa.tokenizers.simple_tokenizer:SimpleTokenizer only tokenizes! Skipping annotators: {'pos', 'ner', 'lemma'}

Switching to the corenlp tokenizer and re-running python3 setup.py install for the change to take effect results in the following output with the same query:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/macaw-0.1-py3.6.egg/macaw/core/input_handler/actions.py", line 104, in run_action
    return_dict[action] = func_timeout(params['timeout'], action_func, args=[conv_list, params])
  File "/usr/local/lib/python3.6/dist-packages/func_timeout/dafunc.py", line 108, in func_timeout
    raise_exception(exception)
  File "/usr/local/lib/python3.6/dist-packages/func_timeout/py3_raise.py", line 7, in raise_exception
    raise exception[0] from None
  File "/usr/local/lib/python3.6/dist-packages/macaw-0.1-py3.6.egg/macaw/core/input_handler/actions.py", line 81, in run
    return params['actions']['qa'].get_results(conv_list, doc)
  File "/usr/local/lib/python3.6/dist-packages/macaw-0.1-py3.6.egg/macaw/core/mrc/drqa_mrc.py", line 76, in get_results
    predictions = self.predictor.predict(doc, q, None, self.params['qa_results_requested'])
  File "/root/DrQA/drqa/reader/predictor.py", line 88, in predict
    results = self.predict_batch([(document, question, candidates,)], top_n)
  File "/root/DrQA/drqa/reader/predictor.py", line 107, in predict_batch
    q_tokens = list(map(self.tokenizer.tokenize, questions))
  File "/root/DrQA/drqa/tokenizers/corenlp_tokenizer.py", line 96, in tokenize
    self.corenlp.expect_exact('NLP>', searchwindowsize=100)
  File "/usr/local/lib/python3.6/dist-packages/pexpect/spawnbase.py", line 390, in expect_exact
    return exp.expect_loop(timeout)
  File "/usr/local/lib/python3.6/dist-packages/pexpect/expect.py", line 99, in expect_loop
    incoming = spawn.read_nonblocking(spawn.maxread, timeout)
  File "/usr/local/lib/python3.6/dist-packages/pexpect/pty_spawn.py", line 437, in read_nonblocking
    if not self.isalive():
  File "/usr/local/lib/python3.6/dist-packages/pexpect/pty_spawn.py", line 662, in isalive
    alive = ptyproc.isalive()
  File "/usr/lib/python3.6/contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.6/dist-packages/pexpect/pty_spawn.py", line 23, in _wrap_ptyprocess_err
    raise ExceptionPexpect(*e.args)
pexpect.exceptions.ExceptionPexpect: isalive() encountered condition where "terminated" is 0, but there was no child process. Did someone else call waitpid() on our process?

If it's any help, I am able to use the DrQA interactive demo with the Stanford CoreNLP tokenizer without errors.

Aside from these errors, maybe I have misunderstood the capabilities or scope of the live_main.py demo? Would it be capable of an interaction similar to the one shown in Figure 1 (b) in the paper?

Thanks in advance for all assistance!

microsoft / macaw

Tokenizer issues / errors and live_main.py capabilities #3