Closed samratdeepprasad closed 1 week ago
- Change default packing config to greedy::drop
Sounds good I will create another Github issue to do this 🟡
- Add scripts for validating the input dataset to make sure it is in the correct format. Check for both the file type and content of the file if its in prompt and completion format that's compatible with the github repo.
Sounds good I will create a separate issue to track this 🟡
- Add support for custom tokenizer and checks (vocab size and others) in case the user wants to use one.
We already support custom tokenizer in this repo ✅
- Take in input hyper-parameters, dataset and construct a call to generative data prep. Surface information in case Studio needs to make changes to user input, eg. Batch Size or RDU count needs to change based on the metadata.yaml file generated.
Sounds good, I can implement a function that takes in training hyper-parameters and outputs data prep hyper-parameters 🟡
- Share the status of the job run before and after Data Prep step needs to happen. (Uploading Dataset,Validating Data format, Data Prep Script Running, Checking for hyperparam setting, Saving Dataset, Run Successful, Run failed)
I believe this logic should be done by Studio, basically call all the API commands from data prep Step 1 Validate input dataset when user uploads, Step 2 Input training hyper-parameters to get generative data prep hyper-parameters, Step 3 Call data prep with the data prep hyper-parameters from step 2. ✅ (No work needed on our end besides the above yellow dots)
For Pt 5 I agree, however, before the job run starts we need to update the API call response with the status at which it is and if there's any error found in a step, like dataset format incorrect or so with a message (statuses could include validating the dataset format and content, running the data prep script, if it failed why so, etc) then once it has run successfully then we can let Studio take over to push it into training.
I'd also like to call out these requests are based on my PRD, if there are some other dependencies that come out of Studio's exploration later that might also need to be included, best to speak with Studio team as well once to clarify dependencies/requirements if any.
cc @snova-zoltanc @vmly
@samratdeepprasad @vmly
Sounds good, I will move ahead with implementing the three steps from pt 5 above.
Then I think it will be the Studio teams responsibility to define how you want me to output any logging data and the API calls based on this code. It may be good to call a meeting to discuss the design of this
Discussed with @samratdeepprasad @vmly and there are 3 Key PR updates we have to make for this
Update Input Path To Accept Directory
Argument Parsing
Dataset Validation
Logging
I will create a separate issue for each of these specific PRs so we can track them more closely here are the links
Directory input support: https://github.com/sambanova/generative_data_prep/issues/114 Argument Parsing: https://github.com/sambanova/generative_data_prep/issues/111 Dataset Validation: https://github.com/sambanova/generative_data_prep/issues/112 Logging: https://github.com/sambanova/generative_data_prep/issues/113
If we agree on these three PRs then I can close this issue and focus on handling each objective independently @vmly @samratdeepprasad
I am closing this issue because all the tasks have been scoped out, I have added the subtasks under a project
Following request for the above feature
Must haves
Additional - Good to have Add in support for multi-modal models and embedding models and make the implemented solution future proof for upcoming newer type of models