stanford-oval / genienlp

GenieNLP: A versatile codebase for any NLP task
Other
85 stars 24 forks source link

How can I run the BERT-LSTM decoder model? #45

Closed anuragkumar95 closed 3 years ago

anuragkumar95 commented 4 years ago

How can I run the BERT-LSTM model with an output language that does not agree with the VAPL described in the "Almond" paper?

gcampax commented 4 years ago

Hi @anuragkumar95!

You can run the BERT-LSTM model as described in the Schema2QA (CIKM) paper with the following command:

genienlp train \
--data datadir --train_tasks_names almond --save model --no_commit --skip_cache --exist_ok \
--train_iterations 80000 --log_every 100 --save_every 1000 --val_every 1000 --preserve_case \
--dimension 768 --transformer_hidden 768 --trainable_decoder_embeddings 50 --encoder_embeddings=bert-base-uncased \
--decoder_embeddings= --seq2seq_encoder=Identity --rnn_layers 1 --transformer_heads 12 --transformer_layers 0 \
--rnn_zero_state=average --train_encoder_embeddings --transformer_lr_multiply 0.1 --train_batch_tokens 9000 \
--append_question_to_context_too --val_batch_size 256

You should format the data in a directory called datadir/almond, which should contain two files train.tsv and eval.tsv. Each file should be tab separated, with ID, sentence, target code. Sentence and target code should be tokenized ahead of time, with each token separated by a space. The sentence, and all parts of the code enclosed in double quotes (a lone " token) will be further tokenized by BERT subword tokenization. (more details at https://wiki.almond.stanford.edu/nlp/dataset)

anuragkumar95 commented 4 years ago

Thanks for the detailed explanation, could you although, explain a bit more about the flag "--train_task_names almond", what does this flag do, does this mean the training data has to agree with the almond format?

gcampax commented 4 years ago

Yes. You can choose any of the other tasks from decaNLP instead, in which case the data should be in those other formats. In fact, for most other tasks the data will be downloaded automatically if missing. The "context" portion of the data will be mapped to the encoder of the BERT-LSTM model and the "question" portion will be ignored. This is appropriate for translation tasks due to how decaNLP encodes translation, but not for contextual question answering tasks.

We recommend the Almond task when doing semantic parsing though, because it handles the tokenization of sentence and code properly, while other tasks will use a generic natural language tokenizer.

anuragkumar95 commented 4 years ago

Hi @gcampax. I am running into the following error upon calling the command,

Traceback (most recent call last):
  File "/home/anurag/.local/bin/genienlp", line 11, in <module>
    sys.exit(main())
  File "/home/anurag/.local/lib/python3.6/site-packages/genienlp/__main__.py", line 54, in main
    subcommands[argv.subcommand][2](argv)
  File "/home/anurag/.local/lib/python3.6/site-packages/genienlp/train.py", line 540, in main
    prepare_data(args, logger)
  File "/home/anurag/.local/lib/python3.6/site-packages/genienlp/train.py", line 87, in prepare_data
    split = task.get_splits(args.data, lower=args.lower, **kwargs)
  File "/home/anurag/.local/lib/python3.6/site-packages/genienlp/tasks/almond/__init__.py", line 145, in get_splits
    return AlmondDataset.return_splits(path=os.path.join(root, 'almond'), make_example=self._make_example, **kwargs)
  File "/home/anurag/.local/lib/python3.6/site-packages/genienlp/tasks/almond/__init__.py", line 95, in return_splits
    train_data = None if train is None else cls(os.path.join(path, train + '.tsv'), **kwargs)
  File "/home/anurag/.local/lib/python3.6/site-packages/genienlp/tasks/almond/__init__.py", line 72, in __init__
    examples.append(make_example(parts, dir_name))
  File "/home/anurag/.local/lib/python3.6/site-packages/genienlp/tasks/almond/__init__.py", line 208, in _make_example
    tokenize=self.tokenize, lower=False)
  File "/home/anurag/.local/lib/python3.6/site-packages/genienlp/data_utils/example.py", line 57, in from_raw
    words, mask = tokenize(arg.rstrip('\n'), field_name=argname)
  File "/home/anurag/.local/lib/python3.6/site-packages/genienlp/tasks/almond/__init__.py", line 185, in tokenize
    mask = [not is_entity(token) and not is_device(token) for token in tokens]
  File "/home/anurag/.local/lib/python3.6/site-packages/genienlp/tasks/almond/__init__.py", line 185, in <listcomp>
    mask = [not is_entity(token) and not is_device(token) for token in tokens]
  File "/home/anurag/.local/lib/python3.6/site-packages/genienlp/tasks/almond/__init__.py", line 112, in is_entity
    return token[0].isupper()
IndexError: string index out of range

I made sure the train and eval files have id, sentence, target code separated with tabs with the sentences themselves tokenized with a space separation. Any comments to fix the above error?

gcampax commented 4 years ago

That error indicates an empty token. This is typically caused by a double space, or a space at the beginning or end of the sentence or program.

anuragkumar95 commented 4 years ago

How would I be able to evaluate the saved models? I tried running predict,

genienlp predict  --tasks almond --data "path/to/datadir" --saved_models "model/path" --path "path/to/model/config/file" --eval_dir "random/path" --evaluate valid --overwrite

It runs, however, it does not create any files with predictions. I would like to run the best-saved model and predict over a testing/validation dataset. I would like to create such a file so that I can evaluate the model's performance over the custom dataset.

NOTE: The command above does not work without --overwrite flag. Is it necessary? What does it overwrite?

How should I go about it?

gcampax commented 4 years ago

genienlp predict will create a file called valid/almond.tsv inside the directory you specify as eval_dir. The format is one line per example, with ID and output, separated by a tab. You should pass a model directory as --path, and the command will load the best model automatically (stored as best.pth). The input data follows the same format as the training data. The overwrite flag tells it to overwrite an existing validation directory.

anuragkumar95 commented 4 years ago

While it does that, at the moment almond.tsv contains around 10 outputs whereas the eval file is 100k long. Any possible bugs there?

gcampax commented 4 years ago

You might be specifying the wrong data directory? The test files in this repo are about 10 examples...

anuragkumar95 commented 4 years ago

OK..., also how should I interpret the decaScore or almond score given at the end. What should be a good score range? Is it similar to %accurate sentences??

gcampax commented 4 years ago

For the almond task, the decascore is exact match accuracy, and it's a percentage (0-100).

anuragkumar95 commented 4 years ago

@gcampax How would I use genienlp to predict over sentences in real time? Is there an API that I can use. I don't want it to load the model every time I run predict. I would like to run the predict function over multiple single sentences instead of a list of sentences in a file.

gcampax commented 4 years ago

You should use genienlp server instead. It communicates over TCP or over standard input/output. The protocol is JSON based, exchanging newline separated records. Each response record contains a request id, task name, context and question. The response record will contain the id and the answer.

You can also do batch prediction by passing id, task and instances. The latter is an array of objects with context and question. If you do batch prediction, you get a record back with id and instances, one object with answer per input instance.

anuragkumar95 commented 3 years ago

@gcampax I am running into some weird issues. I am trying to predict using the trained model by using two commands.

genienlp predict --tasks almond --data datadir  --path model --eval_dir result --evaluate test --skip_cache --overwrite

and

genienlp server --path model --stdin

I get pretty accurate results when I do predictions but the same input gives very different inaccurate predictions when I try to predict using the server API. Is there any reason behind this?

I have trained the BERT -LSTM model with the option "include question in the context". Does this cause this difference?

gcampax commented 3 years ago

You probably need to ensure you pass the right "task" in the server API, and omit the question to use the default question, which is "translate from english to thingtalk"

anuragkumar95 commented 3 years ago

Solved my problem. Thanks!

anuragkumar95 commented 3 years ago

@gcampax The server module sets up a server listening on TCP port 8401. But what host should I connect to in order to send the request? How can I expose this server to a hostname?

gcampax commented 3 years ago

The server command listens on all network interfaces, so you should connect to any routable IP address of the machine (or container) running the command. This is no different than any other process exposing an open port.

The specific IP address or hostname depends on how your network is configured (routing, firewall, DNS). You should contact your sysadmin or someone who knows your infrastructure if you're not familiar with these, especially if you're working in a cloud or university managed setup (as those can get hairy).

anuragkumar95 commented 3 years ago

@gcampax Let's say I am hosting the model on a server with an IP 127.0.0.1. I write a client menthod,

async def send_input(self, message, loop):
        reader, writer = await asyncio.open_connection(host='127.0.0.1', 
                                                        port=8401,
                                                        loop=loop)
        writer.write(message.encode())

        response = await reader.read(1000)
        print(response.decode())

def main():
    message = '{"id":"1","task":"almond","context":"revenue of customer Amazon during 2019","question":"translate from english to thingtalk"}'
    loop = asyncio.get_event_loop()
    loop.run_until_complete(send_input(message, loop))
    loop.close()

This should run, hoping I would be able to send a message to port 8401. Instead I get the following error.

client.py:14: DeprecationWarning: The loop argument is deprecated since Python 3.8, and scheduled for removal in Python 3.10.
  reader, writer = await asyncio.open_connection(host='127.0.0.1',
Traceback (most recent call last):
  File "client.py", line 30, in <module>
    main()
  File "client.py", line 26, in main
    loop.run_until_complete(client.send_input(message, loop))
  File "/home/anurag/anaconda3/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "client.py", line 14, in send_input
    reader, writer = await asyncio.open_connection(host='127.0.0.1', 
  File "/home/anurag/anaconda3/lib/python3.8/asyncio/streams.py", line 52, in open_connection
    transport, _ = await loop.create_connection(
  File "/home/anurag/anaconda3/lib/python3.8/asyncio/base_events.py", line 1025, in create_connection
    raise exceptions[0]
  File "/home/anurag/anaconda3/lib/python3.8/asyncio/base_events.py", line 1010, in create_connection
    sock = await self._connect_sock(
  File "/home/anurag/anaconda3/lib/python3.8/asyncio/base_events.py", line 924, in _connect_sock
    await self.sock_connect(sock, address)
  File "/home/anurag/anaconda3/lib/python3.8/asyncio/selector_events.py", line 494, in sock_connect
    return await fut
  File "/home/anurag/anaconda3/lib/python3.8/asyncio/selector_events.py", line 526, in _sock_connect_cb
    raise OSError(err, f'Connect call failed {address}')
OSError: [Errno 113] Connect call failed ('127.0.0.1', 8401)

Ay ideas on solving this?

gcampax commented 3 years ago

I can't reproduce the connect error: if genienlp server is running, the port is open. You should check with netstat -ntlp if the process is running.

As for your test case, there were a couple fixes needed:

async def send_input(message): reader, writer = await asyncio.open_connection(host='127.0.0.1', port=8401) writer.write(message.encode() + b'\n')

response = await reader.readline()
print(response.decode())

def main(): message = '{"id":"1","task":"almond","context":"revenue of customer Amazon during 2019","question":"translate from english to thingtalk"}' loop = asyncio.get_event_loop() loop.run_until_complete(send_input(message)) loop.close() main()