iLovEing / notebook

MIT License
0 stars 0 forks source link

Hugging Face - transformers #21

Open iLovEing opened 11 months ago

iLovEing commented 11 months ago

链接

原视频课程 示例代码

iLovEing commented 11 months ago

pipline

iLovEing commented 11 months ago

tokenizer

iLovEing commented 11 months ago

model

iLovEing commented 11 months ago

dataset

def preprocess_function(example, tokenizer=tokenizer): model_inputs = tokenizer(example["content"], max_length=512, truncation=True) labels = tokenizer(example["title"], max_length=32, truncation=True)

label就是title编码的结果

model_inputs["labels"] = labels["input_ids"]
return model_inputs

batched: 加速,num_proc: 线程数

processed_datasets = datasets.map(preprocess_function, batched=True, num_proc=24) processed_datasets


- **with DataCollator**

def process_function(examples): tokenized_examples = tokenizer(examples["review"], max_length=128, truncation=True) tokenized_examples["labels"] = examples["label"] return tokenized_examples

tokenized_dataset = dataset.map(process_function, batched=True, remove_columns=dataset.column_names) collator = DataCollatorWithPadding(tokenizer=tokenizer) dl = DataLoader(tokenized_dataset, batch_size=32, collate_fn=collator, shuffle=True)



- **保存和加载**
processed_datasets.save_to_disk("./processed_data")
processed_datasets = load_from_disk("./processed_data")
// 各种加载方法
// 直接加载文件
dataset = load_dataset("csv", data_files="./ChnSentiCorp_htl_all.csv", split="train")
dataset = load_dataset("csv", data_dir="./data/", split='train')
// 从pandas加载,也可以用load_dataset指定类型
dataset = Dataset.from_pandas(df)
// 从list加载,也可以用load_dataset指定类型
data = [{"text": "abc"}, {"text": "def"}]
dataset = Dataset.from_list(data)
iLovEing commented 11 months ago

evaluate

各种任务支持的评价指标,可以在这里找到,进去任务寻找Metrics即可

迭代计算

accuracy = evaluate.load("accuracy") for ref, pred in zip([0,1,0,1], [1,0,0,1]): accuracy.add(references=ref, predictions=pred) accuracy.compute()

batch迭代计算

accuracy = evaluate.load("accuracy") for refs, preds in zip([[0,1],[0,1]], [[1,0],[0,1]]): accuracy.add_batch(references=refs, predictions=preds) accuracy.compute()

多个指标

clf_metrics = evaluate.combine(["accuracy", "f1", "recall", "precision"]) results = clf_metrics.compute(predictions=[0, 1, 0], references=[0, 1, 1])


- **可视化**

from evaluate.visualization import radar_plot

data = [ {"accuracy": 0.99, "precision": 0.8, "f1": 0.95, "latency_in_seconds": 33.6}, {"accuracy": 0.98, "precision": 0.87, "f1": 0.91, "latency_in_seconds": 11.2}, {"accuracy": 0.98, "precision": 0.78, "f1": 0.88, "latency_in_seconds": 87.6}, {"accuracy": 0.88, "precision": 0.78, "f1": 0.81, "latency_in_seconds": 101.6} ]

model_names = ["Model 1", "Model 2", "Model 3", "Model 4"]

plot = radar_plot(data=data, model_names=model_names)

iLovEing commented 11 months ago

trainer

各种任务支持的评价指标,可以在这里找到,进去任务寻找Metrics即可