iLovEing / notebook

MIT License
0 stars 0 forks source link

Hugging Face - transformers #21

Open iLovEing opened 1 year ago

iLovEing commented 1 year ago

链接

原视频课程 示例代码

iLovEing commented 1 year ago

pipline

iLovEing commented 1 year ago

tokenizer

iLovEing commented 1 year ago

model

iLovEing commented 1 year ago

dataset

def preprocess_function(example, tokenizer=tokenizer): model_inputs = tokenizer(example["content"], max_length=512, truncation=True) labels = tokenizer(example["title"], max_length=32, truncation=True)

label就是title编码的结果

model_inputs["labels"] = labels["input_ids"]
return model_inputs

batched: 加速,num_proc: 线程数

processed_datasets = datasets.map(preprocess_function, batched=True, num_proc=24) processed_datasets


- **with DataCollator**

def process_function(examples): tokenized_examples = tokenizer(examples["review"], max_length=128, truncation=True) tokenized_examples["labels"] = examples["label"] return tokenized_examples

tokenized_dataset = dataset.map(process_function, batched=True, remove_columns=dataset.column_names) collator = DataCollatorWithPadding(tokenizer=tokenizer) dl = DataLoader(tokenized_dataset, batch_size=32, collate_fn=collator, shuffle=True)



- **保存和加载**
processed_datasets.save_to_disk("./processed_data")
processed_datasets = load_from_disk("./processed_data")
// 各种加载方法
// 直接加载文件
dataset = load_dataset("csv", data_files="./ChnSentiCorp_htl_all.csv", split="train")
dataset = load_dataset("csv", data_dir="./data/", split='train')
// 从pandas加载,也可以用load_dataset指定类型
dataset = Dataset.from_pandas(df)
// 从list加载,也可以用load_dataset指定类型
data = [{"text": "abc"}, {"text": "def"}]
dataset = Dataset.from_list(data)
iLovEing commented 1 year ago

evaluate

各种任务支持的评价指标,可以在这里找到,进去任务寻找Metrics即可

迭代计算

accuracy = evaluate.load("accuracy") for ref, pred in zip([0,1,0,1], [1,0,0,1]): accuracy.add(references=ref, predictions=pred) accuracy.compute()

batch迭代计算

accuracy = evaluate.load("accuracy") for refs, preds in zip([[0,1],[0,1]], [[1,0],[0,1]]): accuracy.add_batch(references=refs, predictions=preds) accuracy.compute()

多个指标

clf_metrics = evaluate.combine(["accuracy", "f1", "recall", "precision"]) results = clf_metrics.compute(predictions=[0, 1, 0], references=[0, 1, 1])


- **可视化**

from evaluate.visualization import radar_plot

data = [ {"accuracy": 0.99, "precision": 0.8, "f1": 0.95, "latency_in_seconds": 33.6}, {"accuracy": 0.98, "precision": 0.87, "f1": 0.91, "latency_in_seconds": 11.2}, {"accuracy": 0.98, "precision": 0.78, "f1": 0.88, "latency_in_seconds": 87.6}, {"accuracy": 0.88, "precision": 0.78, "f1": 0.81, "latency_in_seconds": 101.6} ]

model_names = ["Model 1", "Model 2", "Model 3", "Model 4"]

plot = radar_plot(data=data, model_names=model_names)

iLovEing commented 1 year ago

trainer

各种任务支持的评价指标,可以在这里找到,进去任务寻找Metrics即可