feat: update heron's datasets - Githubissues

turingmotors / heron

Apache License 2.0

157 stars 25 forks source link

feat: update heron's datasets #31

Closed Ino-Ichan closed 6 months ago

Ino-Ichan commented 6 months ago

Correction and Addition to Dataset

Description

This PR makes corrections and additions to the Dataset

Previously, the dataset was passing all tokens including padding as labels after tokenization, but padding should not be included in the loss calculation, so this has been corrected.
When performing instruction tuning, it is correct to apply loss calculation only to the responses of the LLM, so a dataset that can calculate loss in this manner for the llava, m3it, and Japanese csv dataset has been added.

Related Issue(s)

None

Changes

Fix: In the llava, m3it, and Japanese csv dataset, all padding tokens were being subjected to loss, so this has been changed by setting -100 to ensure they are ignored in loss calculations.
Add: Prepared the llava, m3it, and Japanese csv instruct dataset for instruction tuning, which applies loss calculation only to the GPT responses.

Note: Setting a token to -100 will exclude it from loss and gradient calculations in the use of PyTorch's cross-entropy.

Model/Algorithm Performance

None

Dependencies

None

Reviewer Notes

The operation has been confirmed. Please point out if there are any areas of low readability.

Confirmed

[x] I have updated the documentation accordingly.
[x] I have adhered to the coding standards and guidelines of this project.
[x] I have added comments, especially in hard-to-understand areas.