Thanks for the opportunity to review this tutorial, which I found topical and well-explained. Please find here below my comments and very few suggestions for improvement.
The author addresses a consistent model reader throughout the lesson, which ranges from absolute beginner to advanced beginner, not just in using GPL but in coding with Python and using the cmd line. While this is certainly helpful, I think in reality it is unlikely that absolute beginners would use this tutorial, for example being able to create a virtual env, or manage comfortably the different parameters, etc. My suggestion would be to choose a more intermediate model user to avoid frustrating beginners and absolute beginners and in general to keep the tutorial more honest.
The author does a good job explaining concepts and terms, however, a few notions are taken for granted (e.g., language models) and they are not really explained. Although many people may be more or less familiar with the notion of what a language model is, I believe many of the assumptions that go in the making of these models remain often unclear and they are in fact overshadowed by the idea that because these models are truly huge, they must be very representative and therefore, very reliable. I appreciated the ethical considerations section which indeed highlights the limitations of this methodology and these models, including the biases baked in these models. I think however that the section would greatly benefit from including some further details about implications particularly for historians or researchers working with historical texts/humanities. For example, what was the original intentions behind the creation of these models? What are the risks of using them for humanistic enquiry?
The data section could be expanded a bit more, particularly because the author hints at the possibility to use own data by following the tutorial. I did use the tutorial with my own data, and here's what I found:
As it is, the model expects one .txt file. This means that those who have data in a different format need to prepare their data to fit this format. This may not necessarily be a problem for a python user who has worked with language data before, but this consideration is somewhat assumed. My suggestion is to include a few lines explaining this clearly, and point to resources (even in PH) that could help users get their data ready as the model expects it.
I used both google collab and jupyter as both options are given to the user. Mounting a drive and access data and files in collab is not super intuitive and requires the user to find additional/external resources, including tutorials and forums to find out how to do it. Perhaps a word of warning could be included here.
-in jupyter I found the process of setting up the env and downloading the packages very lengthy, again a word of warning might help the user allocate enough time to run the tutorial.
I tried to tweak the parameters to achieve better performance. When changing the n_steps and batch_size I ran into an index problem. This is not at all mentioned in the tutorial, as the possibility to change the parameters is described as unproblematic. So perhaps mentioning that each change may run in errors and explain why/how to solve them (like the memory problem) could help.
FYI, in Jupyter, even after installing all the packages successfully, I got the error: CUDA is not installed.
In terms of payoff, the tutorial uses GPT-2 for language production, but it could be helpful to at least mention all the other tasks it is used for (for instance translation, as GPT-2 is essentially trained on English, a major problem for digital language injustice).
As per the logical sequence of steps, I think mentioning earlier in the tutorial that GPT-Neo can also be an option would be beneficial to the user, perhaps with a reference to the ethical section.
Finally, in reference to the environmental impact of training models, it could be helpful to refer the reader to recent work done towards reducing it (see for instance https://aclanthology.org/2021.findings-acl.74/) because currently the impression is that such big cost for the environment is inevitable.
Thanks again for your important contribution and I hope you'll find my suggestions helpful.
Hey @lorellav, I am just noting here that this review is part of the issue #418 so that it gets indexed there cc @tiagosousagarcia @jrladd @anisa-hawes
@tiagosousagarcia @jrladd @anisa-hawes
Review #418
Thanks for the opportunity to review this tutorial, which I found topical and well-explained. Please find here below my comments and very few suggestions for improvement.
The author addresses a consistent model reader throughout the lesson, which ranges from absolute beginner to advanced beginner, not just in using GPL but in coding with Python and using the cmd line. While this is certainly helpful, I think in reality it is unlikely that absolute beginners would use this tutorial, for example being able to create a virtual env, or manage comfortably the different parameters, etc. My suggestion would be to choose a more intermediate model user to avoid frustrating beginners and absolute beginners and in general to keep the tutorial more honest.
The author does a good job explaining concepts and terms, however, a few notions are taken for granted (e.g., language models) and they are not really explained. Although many people may be more or less familiar with the notion of what a language model is, I believe many of the assumptions that go in the making of these models remain often unclear and they are in fact overshadowed by the idea that because these models are truly huge, they must be very representative and therefore, very reliable. I appreciated the ethical considerations section which indeed highlights the limitations of this methodology and these models, including the biases baked in these models. I think however that the section would greatly benefit from including some further details about implications particularly for historians or researchers working with historical texts/humanities. For example, what was the original intentions behind the creation of these models? What are the risks of using them for humanistic enquiry? The data section could be expanded a bit more, particularly because the author hints at the possibility to use own data by following the tutorial. I did use the tutorial with my own data, and here's what I found:
In terms of payoff, the tutorial uses GPT-2 for language production, but it could be helpful to at least mention all the other tasks it is used for (for instance translation, as GPT-2 is essentially trained on English, a major problem for digital language injustice). As per the logical sequence of steps, I think mentioning earlier in the tutorial that GPT-Neo can also be an option would be beneficial to the user, perhaps with a reference to the ethical section.
Finally, in reference to the environmental impact of training models, it could be helpful to refer the reader to recent work done towards reducing it (see for instance https://aclanthology.org/2021.findings-acl.74/) because currently the impression is that such big cost for the environment is inevitable.
Thanks again for your important contribution and I hope you'll find my suggestions helpful.