Open mosalov opened 6 days ago
Thank you for identifying and sharing this, Oleg! This looks reasonable, albeit we might want to make sure only the tokenizer arg gets passed to compare, not any other "other_args". You're welcome to create a PR for this case as well or we can make these updates.
CC: @nv-hwoo
Thanks David!
Other_args
include 3 parameters related to a tokenizer and one for verbosity.
What would be the right way for a pull-request?
tokenizer
parameter.other_args
.Thanks for checking and proposing solutions! We originally had one tokenizer arg and it looks like it expanded to three over time.
My suggestion would be 2, so that there are no issues with custom tokenizers. I would recommend creating a separate category and function for the tokenizer args, so you can add those args both to profile and compare. The verbosity can be left in the other_args category and function.
It looks to me that whereas the
profile
command supports usage of the--tokenizer
argument, thecompare
command does not have such an option. That leads to a case when you can have a series of experiments run with a tokenizer of your choice and then when trying to compare the results you would get distorted ranges for input and output sequence lengths.I have experienced that while running on Llama3_2-3B with the included in the model tokenizer. All my experiments were run with input and output lengths being equal to 256, and when I have tried building plots with the
--generate-plots
argument all ranges were as expected. But when I have tried tocompare
my experiments, inputs became of the size 285 and outputs - of the size 347, all due to the default tokenizer being used.I have circumnavigated the issue on my side with the following patch:
which allowed me to pass the tokenizer to the
compare
command and to receive properly formatted plots.I am not 100% sure that this approach is a way to go, so I would like to get a confirmation. Also I believe it is worth reviewing other cases of tokenizer usage in the codebase.