mikeizbicki / modulus-magnus-linguae

8 stars 6 forks source link

Abstract feedback? #56

Open alysawyer opened 1 year ago

alysawyer commented 1 year ago

Hi! We just remembered that we didn't discuss the abstract in the meeting today so if you have any comments for the abstract that we can implement before 3pm please let us know!!

mikeizbicki commented 1 year ago

Recall that there are 4 questions that need to be answered in the abstract:

  1. What are other people doing?
  2. Why is it bad?
  3. What do we do different?
  4. Why is it better?

You partially address 3+4, but don't address 1+2 at all. A better example is:

The best language models are trained on more than 1 trillion tokens of English language text. Most languages, however, do not have such large training datasets available. We investigate an extremely data-limited regime where only 80,000 tokens of text are available in the form of a high-quality Latin textbook. We also introduce a new dataset for evaluating Latin models that contains over 5000 high-quality human annotated questions and answers that were originally designed to assess human learning. We find that the small, high-quality textbook data is sufficient to improve the performance of language models on this new dataset.

This still isn't perfect because I haven't tied in any of the changes we talked about earlier into the abstract here. For example:

  1. If the phrase textbook training is something that you decide to keep highlighting as a main contribution in the paper, then it would ideally be explicitly introduced in the abstract.
  2. There is no mention of the data contamination issues.

You can feel free to try to add in those into the abstract I wrote or just use it verbatim. (Or maybe it might need to be shrunk slightly for space reasons.)

alysawyer commented 1 year ago

Thank you!