Great project! - Githubissues

1Dbcj commented 1 year ago

this is incredible. What amazing work. Thank you for this effort. Any thoughts on how you would use this to train a local LLM?

nazmulkazi commented 1 year ago

What do you mean? It is pretty straight forward.

1Dbcj commented 1 year ago

I'm sure it is.

full disclosure, I'm a psychiatric resident with minimal data science/coding knowledge, only really have learned some python and trying to contribute some ideas to the field. I've been teaching myself what I can to implement on clinical projects but am finding myself frequently out of my depth. ChatGPT has been my first tutor, but with its 2021 data cut off, training models has been an error filled journey.

I guess I'm hoping for some recommendations if you can spare them. I'd imagine this would be best used with something like BERT or one of the smaller LLMs like vicuna from huggingfaces, but I'm struggling with the implementation and training using the pickle file.

Did you train a model on huggingfaces with this already?

nazmulkazi commented 11 months ago

First of all, I apologize for the delayed response. Thanks for the full disclosure; it has enabled me to grasp precisely what you are seeking. I recommend taking a look at this page which offers a helpful beginner's example of how to fine-tune an LLM (I suggest using PyTorch). I think you know already that training and fine-tuning are not the same in NLP and they have very specific meanings.

FYI, training and fine-tuning are distinct concepts with specific meanings in NLP. What you are interested in is referred to as fine-tuning.

In my M.Sc. thesis, I mostly experimented with BERT, which exhibited promising performance on this data. However, based on my extensive knowledge of LLMs, I highly recommend RoBERTa-large over other BERT-based models. GPT-based models should also show remarkable performance.

Typically, models or frameworks do not accept pickle files as direct input. A pickle file is akin to a Python variable that stores data, offering the benefit that you don't need to parse or format the data after reading it from the files. Nevertheless, before feeding the data to a model, you will still need to select and/or transform it after loading it from the pickle file.

While I cannot share the Python script I used to train the model in 2021, as it has become outdated and no longer works, I thought of providing a reference to my M.Sc. thesis, which might contain valuable information for your project: Kazi, Nazmul Hasan. AUTOMATED CLINICAL TRANSCRIPTION FOR BEHAVIORAL HEALTH CLINICIANS. Diss. MONTANA STATE UNIVERSITY Bozeman, 2021.

Please feel free to reach out to me if you need further help or have any questions. I hope it helps.

nazmulkazi / dataset_automated_medical_transcription

Great project! #4