salesforce / ctrl

Conditional Transformer Language Model for Controllable Generation
https://arxiv.org/abs/1909.05858
BSD 3-Clause "New" or "Revised" License
1.87k stars 208 forks source link

Using new top-level domain with Links control code #60

Closed pgrandinetti closed 4 years ago

pgrandinetti commented 5 years ago

I want to fine tune the model with data from a single domain, e.g. my-blog.com/article1, my-blog.com/article2 etc. to produce text that has the style of my-blog.com

1) Can I use your training_utils/training.py with the "Links" control code? If so, how should I modify the make_tf_records.py to have all different URLs in the training dataset?

2) What must be the format of the input text_file? Should I just append the contents from all URLs into the same txt file?

keskarnitish commented 4 years ago
  1. This is fine; you can also use a separate token like Blog and then having the URL as the first part of every new article.
  2. Yeah; concatenating is fine. Prepend the URL as above. What I'd recommend though is to create one TF record per file; the training script will pick up all TFRecords in the active folder.