This PR makes corrections and additions to the Dataset
Previously, the dataset was passing all tokens including padding as labels after tokenization, but padding should not be included in the loss calculation, so this has been corrected.
When performing instruction tuning, it is correct to apply loss calculation only to the responses of the LLM, so a dataset that can calculate loss in this manner for the llava, m3it, and Japanese csv dataset has been added.
Related Issue(s)
None
Changes
Fix: In the llava, m3it, and Japanese csv dataset, all padding tokens were being subjected to loss, so this has been changed by setting -100 to ensure they are ignored in loss calculations.
Add: Prepared the llava, m3it, and Japanese csv instruct dataset for instruction tuning, which applies loss calculation only to the GPT responses.
Note: Setting a token to -100 will exclude it from loss and gradient calculations in the use of PyTorch's cross-entropy.
Model/Algorithm Performance
None
Dependencies
None
Reviewer Notes
The operation has been confirmed. Please point out if there are any areas of low readability.
Confirmed
[x] I have updated the documentation accordingly.
[x] I have adhered to the coding standards and guidelines of this project.
[x] I have added comments, especially in hard-to-understand areas.
Correction and Addition to Dataset
Description
This PR makes corrections and additions to the Dataset
Related Issue(s)
Changes
Note: Setting a token to -100 will exclude it from loss and gradient calculations in the use of PyTorch's cross-entropy.
Model/Algorithm Performance
Dependencies
Reviewer Notes
Confirmed