muniefht commented 2 months ago

Hi, I am new in the field and trying first time to finetune a model. I am working with torchtune on the lora_finetune_single_device . while i was able to do the finetuning using the alpaca built in dataset. Now I am trying to do the fine tuning on a custom dataset. The data is a csv file containing abusive and non abusive tweets and I am trying to fine tune the model on urdu language abuse detection. So one column contains "tweets" other column contains "target" (0,1) . I thought that the instruct_dataset() format would be the most suited format for such a problem. So I created a custom template. I wrote the following code in it: from torchtune.data import InstructTemplate from typing import Mapping, Any, Optional, Dict

class AbusiveLanguageDetectionTemplate(InstructTemplate): template = ( "You are an abusive language detection model for Urdu. Your job is to detect the abusive language in the Urdu sentences. " "Output '1' if the sentence is abusive and output '0' if the sentence is non-abusive. No explanation is required.\n\n" "### Input:\n{tweet}\n\n### Response:\n{target}\n" )

@classmethod
def format(
    cls, sample: Mapping[str, Any], column_map: Optional[Dict[str, str]] = None
) -> str:
    if column_map:
        input_column = column_map.get("tweet", "tweet")
        response_column = column_map.get("target", "target")
    else:
        input_column = "tweet"
        response_column = "target"

    return cls.template.format(tweet=sample[input_column], target=str(sample[response_column]))

I have saved the code in the file named "abuse_detection.py" which I consider is my custom template. Now I am trying to link this template to my custom_config.yaml file. For the dataset field, I have specified the following things:

Dataset and Sampler

dataset: component: torchtune.datasets.instruct_dataset source: abusive_train.csv template: abuse_detection.AbusiveLanguageDetectionTemplate max_seq_len: 4096
train_on_input: False packed: False batch_size: 2 seed: null shuffle: True where "abusive_train.csv" is the file name of my csv file. Now my custom_config.yaml file, abusive_train.csv file as well as abuse_detection.py file all are located in the same directory and I am running the following command: tune run lora_finetune_single_device --config custom_config.yaml but I am getting the following errror: ModuleNotFoundError("No module named 'abuse_detection'") Are you sure that module 'abuse_detection' is installed? Can someone point to me what I am doing wrong. Where should I place the abuse_detection.py file for it to be picked by the system. Please help.

felipemello1 commented 2 months ago

Hey @muniefht , cab you try passing the whole path? my.path.to.abuse_detection.AbusiveLanguageDetectionTemplate

Not sure if that will solve it, but i think its an easy one to try

muniefht commented 2 months ago

I tried doing that. It did not work. Also I am using a shared server where we have different users. I am not a root user. But that should not be any problem? I dont know. I have found a work around on the problem. I placed my template inside torchtune repository.. in site-packages in torchtune and then wrote the path as torchtune.abuse_detection.AbusiveLanguageDetectionTemplate and that worked..

zjost commented 3 weeks ago

For others with this problem, I found the following workaround.

Let's say you are currently in some directory we'll call cwd, and your file with custom function my_function is at some path: cwd/custom/pyfile.py. Then, in your recipe, put: _component_: custom.pyfile.my_function.

And then, when you use tune, prepend the following: PYTHONPATH=${pwd}:PYTHONPATH tune ...

This will tell Python to also look in cwd, which is returned by ${pwd}.

RdoubleA commented 3 weeks ago

This should've been fixed in #1760 and #1731. When you run the same command without modifying PYTHONPATH, do you still run into issues? @zjost

pytorch / torchtune

Custom instruct template for task specific finetuning on Llama 3.1 using torchtune : Module not found error for custom template #1295

Dataset and Sampler