sambanova / generative_data_prep

Apache License 2.0
58 stars 7 forks source link

Integration of Data Prep Github Repo into Studio #109

Closed samratdeepprasad closed 1 week ago

samratdeepprasad commented 3 weeks ago

Following request for the above feature

Must haves

  1. Change default packing config to greedy::drop
  2. Add scripts for validating the input dataset to make sure it is in the correct format. Check for both the file type and content of the file if its in prompt and completion format that's compatible with the github repo.
  3. Add support for custom tokeniser and checks (vocab size and others) in case the user wants to use one.
  4. Take in input hyper-parameters, dataset and construct a call to generative data prep. Surface information in case Studio needs to make changes to user input, eg. Batch Size or RDU count needs to change based on the metadata.yaml file generated.
  5. Share the status of the job run before and after Data Prep step needs to happen. (Uploading Dataset,Validating Data format, Data Prep Script Running ,Checking for hyperparam setting, Saving Dataset, Run Successful, Run failed)

Additional - Good to have Add in support for multi-modal models and embedding models and make the implemented solution future proof for upcoming newer type of models

snova-zoltanc commented 3 weeks ago
  1. Change default packing config to greedy::drop

Sounds good I will create another Github issue to do this 🟡

  1. Add scripts for validating the input dataset to make sure it is in the correct format. Check for both the file type and content of the file if its in prompt and completion format that's compatible with the github repo.

Sounds good I will create a separate issue to track this 🟡

  1. Add support for custom tokenizer and checks (vocab size and others) in case the user wants to use one.

We already support custom tokenizer in this repo ✅

  1. Take in input hyper-parameters, dataset and construct a call to generative data prep. Surface information in case Studio needs to make changes to user input, eg. Batch Size or RDU count needs to change based on the metadata.yaml file generated.

Sounds good, I can implement a function that takes in training hyper-parameters and outputs data prep hyper-parameters 🟡

  1. Share the status of the job run before and after Data Prep step needs to happen. (Uploading Dataset,Validating Data format, Data Prep Script Running, Checking for hyperparam setting, Saving Dataset, Run Successful, Run failed)

I believe this logic should be done by Studio, basically call all the API commands from data prep Step 1 Validate input dataset when user uploads, Step 2 Input training hyper-parameters to get generative data prep hyper-parameters, Step 3 Call data prep with the data prep hyper-parameters from step 2. ✅ (No work needed on our end besides the above yellow dots)

samratdeepprasad commented 2 weeks ago

For Pt 5 I agree, however, before the job run starts we need to update the API call response with the status at which it is and if there's any error found in a step, like dataset format incorrect or so with a message (statuses could include validating the dataset format and content, running the data prep script, if it failed why so, etc) then once it has run successfully then we can let Studio take over to push it into training.

I'd also like to call out these requests are based on my PRD, if there are some other dependencies that come out of Studio's exploration later that might also need to be included, best to speak with Studio team as well once to clarify dependencies/requirements if any.

cc @snova-zoltanc @vmly

snova-zoltanc commented 2 weeks ago

@samratdeepprasad @vmly

Sounds good, I will move ahead with implementing the three steps from pt 5 above.

Then I think it will be the Studio teams responsibility to define how you want me to output any logging data and the API calls based on this code. It may be good to call a meeting to discuss the design of this

snova-zoltanc commented 2 weeks ago

Discussed with @samratdeepprasad @vmly and there are 3 Key PR updates we have to make for this

Update Input Path To Accept Directory

  1. Update input_file_path argument to input_path, allow it to take a directory that has one jsonl under it a. In a future update we can accept a directory of split jsonls

Argument Parsing

  1. Implement a function that takes in SambaStudio arguments and returns the data preparation arguments.
  2. Update the main data preparation function to accept arguments that are returned from the above function.
  3. Validate the arguments to make sure they work together a. tokenizer and checkpoint compatibility

Dataset Validation

  1. Create a function independent of the data preparation process that simply takes a path to an input file, validates whether it is in the correct format, and either returns true or throws an error. a. Save the input file size, number of lines etc, but most metadata will not be availble
  2. Also write a function to validate a HDF5

Logging

  1. Add an input flag to indicate whether this data preparation command is from Studio or non-interactive. a. If the command is from Studio, simplify logging to only log to the console (using a different logging configuration) and remove the dynamic progress bar.
  2. Reduce progress logging in the log file to every 10 seconds.
  3. Save the dataset metrics into the metadata YAML file so they can be displayed under each dataset's card.
snova-zoltanc commented 2 weeks ago

I will create a separate issue for each of these specific PRs so we can track them more closely here are the links

Directory input support: https://github.com/sambanova/generative_data_prep/issues/114 Argument Parsing: https://github.com/sambanova/generative_data_prep/issues/111 Dataset Validation: https://github.com/sambanova/generative_data_prep/issues/112 Logging: https://github.com/sambanova/generative_data_prep/issues/113

If we agree on these three PRs then I can close this issue and focus on handling each objective independently @vmly @samratdeepprasad

snova-zoltanc commented 1 week ago

I am closing this issue because all the tasks have been scoped out, I have added the subtasks under a project

https://github.com/orgs/sambanova/projects/1/views/1