huggingface / transfer-learning-conv-ai

🦄 State-of-the-Art Conversational AI with Transfer Learning
MIT License
1.73k stars 431 forks source link

How are the distractors made in the dataset? #77

Open Cakeszy opened 4 years ago

Cakeszy commented 4 years ago

I want to use my own custom dataset with this project, but I don't understand how the distractors were made in the original dataset to get a grasp on how to do this. Are they randomly sampled from other conversations?

DamienLopez1 commented 4 years ago

I suggest looking at example_entry.py. The candidates are all possible replies to the prompt sentence. In train.py:

for j, candidate in enumerate(utterance["candidates"][-num_candidates:]): lm_labels = bool(j == num_candidates-1) instance = build_input_from_segments(persona, history, candidate, tokenizer, lm_labels)

The last sentence in candidate is taken as the gold reply. Everything else in candidates is taken as the distractor.

made-by-chris commented 4 years ago

I suggest looking at example_entry.py. The candidates are all possible replies to the prompt sentence. In train.py:

Please could you share which command you're running to train on the example_entry.py?

I'm trying (without modifying example_entry.py) python ./train.py --dataset_path=example_entry.py but i get errors like

ERROR:ignite.engine.engine.Engine:Engine run is terminating due to exception: Target -100 is out of bounds..
DamienLopez1 commented 4 years ago

Sorry for the late response.

I did not really use the example_entry.py to run an example. As far as I am aware example_entry.py is just an example of the format in the JSON files.

If you want to see how all the distractors being selected I suggest you add a print statement in this code snippet from train.py:

for j, candidate in enumerate(utterance["candidates"][-num_candidates:]): lm_labels = bool(j == num_candidates-1) print(candidate) instance = build_input_from_segments(persona, history, candidate, tokenizer, lm_labels)

In my code its at line 93.