Closed anuragkumar95 closed 3 years ago
Hi @anuragkumar95!
You can run the BERT-LSTM model as described in the Schema2QA (CIKM) paper with the following command:
genienlp train \
--data datadir --train_tasks_names almond --save model --no_commit --skip_cache --exist_ok \
--train_iterations 80000 --log_every 100 --save_every 1000 --val_every 1000 --preserve_case \
--dimension 768 --transformer_hidden 768 --trainable_decoder_embeddings 50 --encoder_embeddings=bert-base-uncased \
--decoder_embeddings= --seq2seq_encoder=Identity --rnn_layers 1 --transformer_heads 12 --transformer_layers 0 \
--rnn_zero_state=average --train_encoder_embeddings --transformer_lr_multiply 0.1 --train_batch_tokens 9000 \
--append_question_to_context_too --val_batch_size 256
You should format the data in a directory called datadir/almond
, which should contain two files train.tsv
and eval.tsv
. Each file should be tab separated, with ID, sentence, target code. Sentence and target code should be tokenized ahead of time, with each token separated by a space. The sentence, and all parts of the code enclosed in double quotes (a lone "
token) will be further tokenized by BERT subword tokenization.
(more details at https://wiki.almond.stanford.edu/nlp/dataset)
Thanks for the detailed explanation, could you although, explain a bit more about the flag "--train_task_names almond", what does this flag do, does this mean the training data has to agree with the almond format?
Yes. You can choose any of the other tasks from decaNLP instead, in which case the data should be in those other formats. In fact, for most other tasks the data will be downloaded automatically if missing. The "context" portion of the data will be mapped to the encoder of the BERT-LSTM model and the "question" portion will be ignored. This is appropriate for translation tasks due to how decaNLP encodes translation, but not for contextual question answering tasks.
We recommend the Almond task when doing semantic parsing though, because it handles the tokenization of sentence and code properly, while other tasks will use a generic natural language tokenizer.
Hi @gcampax. I am running into the following error upon calling the command,
Traceback (most recent call last):
File "/home/anurag/.local/bin/genienlp", line 11, in <module>
sys.exit(main())
File "/home/anurag/.local/lib/python3.6/site-packages/genienlp/__main__.py", line 54, in main
subcommands[argv.subcommand][2](argv)
File "/home/anurag/.local/lib/python3.6/site-packages/genienlp/train.py", line 540, in main
prepare_data(args, logger)
File "/home/anurag/.local/lib/python3.6/site-packages/genienlp/train.py", line 87, in prepare_data
split = task.get_splits(args.data, lower=args.lower, **kwargs)
File "/home/anurag/.local/lib/python3.6/site-packages/genienlp/tasks/almond/__init__.py", line 145, in get_splits
return AlmondDataset.return_splits(path=os.path.join(root, 'almond'), make_example=self._make_example, **kwargs)
File "/home/anurag/.local/lib/python3.6/site-packages/genienlp/tasks/almond/__init__.py", line 95, in return_splits
train_data = None if train is None else cls(os.path.join(path, train + '.tsv'), **kwargs)
File "/home/anurag/.local/lib/python3.6/site-packages/genienlp/tasks/almond/__init__.py", line 72, in __init__
examples.append(make_example(parts, dir_name))
File "/home/anurag/.local/lib/python3.6/site-packages/genienlp/tasks/almond/__init__.py", line 208, in _make_example
tokenize=self.tokenize, lower=False)
File "/home/anurag/.local/lib/python3.6/site-packages/genienlp/data_utils/example.py", line 57, in from_raw
words, mask = tokenize(arg.rstrip('\n'), field_name=argname)
File "/home/anurag/.local/lib/python3.6/site-packages/genienlp/tasks/almond/__init__.py", line 185, in tokenize
mask = [not is_entity(token) and not is_device(token) for token in tokens]
File "/home/anurag/.local/lib/python3.6/site-packages/genienlp/tasks/almond/__init__.py", line 185, in <listcomp>
mask = [not is_entity(token) and not is_device(token) for token in tokens]
File "/home/anurag/.local/lib/python3.6/site-packages/genienlp/tasks/almond/__init__.py", line 112, in is_entity
return token[0].isupper()
IndexError: string index out of range
I made sure the train and eval files have id, sentence, target code separated with tabs with the sentences themselves tokenized with a space separation. Any comments to fix the above error?
That error indicates an empty token. This is typically caused by a double space, or a space at the beginning or end of the sentence or program.
How would I be able to evaluate the saved models? I tried running predict,
genienlp predict --tasks almond --data "path/to/datadir" --saved_models "model/path" --path "path/to/model/config/file" --eval_dir "random/path" --evaluate valid --overwrite
It runs, however, it does not create any files with predictions. I would like to run the best-saved model and predict over a testing/validation dataset. I would like to create such a file so that I can evaluate the model's performance over the custom dataset.
NOTE: The command above does not work without --overwrite flag. Is it necessary? What does it overwrite?
How should I go about it?
genienlp predict
will create a file called valid/almond.tsv
inside the directory you specify as eval_dir
. The format is one line per example, with ID and output, separated by a tab.
You should pass a model directory as --path
, and the command will load the best model automatically (stored as best.pth
). The input data follows the same format as the training data.
The overwrite flag tells it to overwrite an existing validation directory.
While it does that, at the moment almond.tsv contains around 10 outputs whereas the eval file is 100k long. Any possible bugs there?
You might be specifying the wrong data directory? The test files in this repo are about 10 examples...
OK..., also how should I interpret the decaScore or almond score given at the end. What should be a good score range? Is it similar to %accurate sentences??
For the almond task, the decascore is exact match accuracy, and it's a percentage (0-100).
@gcampax How would I use genienlp to predict over sentences in real time? Is there an API that I can use. I don't want it to load the model every time I run predict. I would like to run the predict function over multiple single sentences instead of a list of sentences in a file.
You should use genienlp server
instead. It communicates over TCP or over standard input/output. The protocol is JSON based, exchanging newline separated records. Each response record contains a request id
, task
name, context
and question
. The response record will contain the id
and the answer
.
You can also do batch prediction by passing id
, task
and instances
. The latter is an array of objects with context
and question
. If you do batch prediction, you get a record back with id
and instances
, one object with answer
per input instance.
@gcampax I am running into some weird issues. I am trying to predict using the trained model by using two commands.
genienlp predict --tasks almond --data datadir --path model --eval_dir result --evaluate test --skip_cache --overwrite
and
genienlp server --path model --stdin
I get pretty accurate results when I do predictions but the same input gives very different inaccurate predictions when I try to predict using the server API. Is there any reason behind this?
I have trained the BERT -LSTM model with the option "include question in the context". Does this cause this difference?
You probably need to ensure you pass the right "task" in the server API, and omit the question to use the default question, which is "translate from english to thingtalk"
Solved my problem. Thanks!
@gcampax The server module sets up a server listening on TCP port 8401. But what host should I connect to in order to send the request? How can I expose this server to a hostname?
The server command listens on all network interfaces, so you should connect to any routable IP address of the machine (or container) running the command. This is no different than any other process exposing an open port.
The specific IP address or hostname depends on how your network is configured (routing, firewall, DNS). You should contact your sysadmin or someone who knows your infrastructure if you're not familiar with these, especially if you're working in a cloud or university managed setup (as those can get hairy).
@gcampax Let's say I am hosting the model on a server with an IP 127.0.0.1. I write a client menthod,
async def send_input(self, message, loop):
reader, writer = await asyncio.open_connection(host='127.0.0.1',
port=8401,
loop=loop)
writer.write(message.encode())
response = await reader.read(1000)
print(response.decode())
def main():
message = '{"id":"1","task":"almond","context":"revenue of customer Amazon during 2019","question":"translate from english to thingtalk"}'
loop = asyncio.get_event_loop()
loop.run_until_complete(send_input(message, loop))
loop.close()
This should run, hoping I would be able to send a message to port 8401. Instead I get the following error.
client.py:14: DeprecationWarning: The loop argument is deprecated since Python 3.8, and scheduled for removal in Python 3.10.
reader, writer = await asyncio.open_connection(host='127.0.0.1',
Traceback (most recent call last):
File "client.py", line 30, in <module>
main()
File "client.py", line 26, in main
loop.run_until_complete(client.send_input(message, loop))
File "/home/anurag/anaconda3/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
return future.result()
File "client.py", line 14, in send_input
reader, writer = await asyncio.open_connection(host='127.0.0.1',
File "/home/anurag/anaconda3/lib/python3.8/asyncio/streams.py", line 52, in open_connection
transport, _ = await loop.create_connection(
File "/home/anurag/anaconda3/lib/python3.8/asyncio/base_events.py", line 1025, in create_connection
raise exceptions[0]
File "/home/anurag/anaconda3/lib/python3.8/asyncio/base_events.py", line 1010, in create_connection
sock = await self._connect_sock(
File "/home/anurag/anaconda3/lib/python3.8/asyncio/base_events.py", line 924, in _connect_sock
await self.sock_connect(sock, address)
File "/home/anurag/anaconda3/lib/python3.8/asyncio/selector_events.py", line 494, in sock_connect
return await fut
File "/home/anurag/anaconda3/lib/python3.8/asyncio/selector_events.py", line 526, in _sock_connect_cb
raise OSError(err, f'Connect call failed {address}')
OSError: [Errno 113] Connect call failed ('127.0.0.1', 8401)
Ay ideas on solving this?
I can't reproduce the connect error: if genienlp server
is running, the port is open. You should check with netstat -ntlp
if the process is running.
As for your test case, there were a couple fixes needed:
\n
or the server will wait until more data is sentreader.readline()
instead of reader.read(1000)
otherwise the call will block forever waiting for 1000 bytes of data
This one should work:
import asyncio
async def send_input(message): reader, writer = await asyncio.open_connection(host='127.0.0.1', port=8401) writer.write(message.encode() + b'\n')
response = await reader.readline()
print(response.decode())
def main(): message = '{"id":"1","task":"almond","context":"revenue of customer Amazon during 2019","question":"translate from english to thingtalk"}' loop = asyncio.get_event_loop() loop.run_until_complete(send_input(message)) loop.close() main()
How can I run the BERT-LSTM model with an output language that does not agree with the VAPL described in the "Almond" paper?