单条训练数据超过了token的限制

rogerslh commented 1 month ago

请教大佬一个问题，在hugging-face上看到大佬的一个数据集，单条数据还挺多的，感觉超过了训练时token的限制，这个是怎么处理的呢

fly-dust commented 1 month ago

一般来说finetune库会自动把多出来的部分截掉的，就不训练那些了

rogerslh commented 1 month ago

那可能会导致训练效果不符合预期

发件人: Zhangchen Xu @.> 发送时间: 2024年8月27日 12:35 收件人: magpie-align/magpie @.> 抄送: rogerslh @.>; Author @.> 主题: Re: [magpie-align/magpie] 单条训练数据超过了token的限制 (Issue #25)

一般来说finetune库会自动把多出来的部分截掉的，就不训练那些了

― Reply to this email directly, view it on GitHubhttps://github.com/magpie-align/magpie/issues/25#issuecomment-2311553129, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A7JL3V5FCHCFYEGC5QF72TTZTP6ZPAVCNFSM6AAAAABNFEWBU2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJRGU2TGMJSHE. You are receiving this because you authored the thread.Message ID: @.***>

fly-dust commented 4 weeks ago

感觉似乎影响不大? 对于Llama3来说8192的上下文应该绰绰有余了

magpie-align / magpie

单条训练数据超过了token的限制 #25