panuthept / IRIS

Improving Robustness of LLMs on Input Variations by Mitigating Spurious Intermediate States
Apache License 2.0
8 stars 3 forks source link

[Instruction Augmentation] Add PAIR Jailbreak method #13

Closed panuthept closed 1 week ago

panuthept commented 2 months ago

We can use EasyJailbreak(https://github.com/EasyJailbreak/EasyJailbreak). The interface should look something like this:

from iris.augmentations.instruction_augmentations import InstructionAugmentation

class PAIRJailbreak(InstructionAugmentation):
  def __init__(self, attack_model, target_model, eval_model):
    pass

  def augment(self, instruction: str, reference_response: str) -> List[str]:
    pass
panuthept commented 2 months ago

Modification scope:

src/iris/augmentations/instruction_augmentations
panuthept commented 1 month ago

You can use GPT-4o as an attack model for the sake of simplicity.

from llama_index.llms.openai import OpenAI
from iris.model_wrappers.generative_models import APIGenerativeLLM

attack_model=APIGenerativeLLM(
    llm=OpenAI(
        model="gpt-4o",
        api_key=os.environ.get("OPENAI_API_KEY"),
    ),
)
popochangli commented 1 month ago

0%| | 0/100 [00:00<?, ?it/s]Traceback (most recent call last): File "d:\Users\mpmac\Documents\GitHub\IRISS\IRIS\src\iris\augmentations\instruction_augmentations\jailbreaks\multilingual_jailbreak.py", line 136, in jailbreaked_samples = augmentation.augment_batch(harmful_samples) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "d:\mpmac\intern\iris\src\iris\augmentations\instruction_augmentations\jailbreaks\base.py", line 37, in augment_batch samples = super().augment_batch(samples, verbose=verbose) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "d:\mpmac\intern\iris\src\iris\augmentations\instruction_augmentations\base.py", line 25, in augment_batch return [self.augment_sample(sample) for sample in tqdm(samples, disable=not verbose)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "d:\mpmac\intern\iris\src\iris\augmentations\instruction_augmentations\base.py", line 25, in return [self.augment_sample(sample) for sample in tqdm(samples, disable=not verbose)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "d:\mpmac\intern\iris\src\iris\augmentations\instruction_augmentations\base.py", line 20, in augment_sample sample.instructions = self.augment(original_instruction, reference_answers) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "d:\mpmac\intern\iris\src\iris\augmentations\instruction_augmentations\jailbreaks\base.py", line 29, in augment attack_results = self._attack(instruction=instruction, reference_answers=reference_answers) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "d:\Users\mpmac\Documents\GitHub\IRISS\IRIS\src\iris\augmentations\instruction_augmentations\jailbreaks\multilingual_jailbreak.py", line 94, in _attack transformed_dataset = mutation(instance_dataset) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\mpmac\anaconda3\envs\iris\Lib\site-packages\easyjailbreak\mutation\mutation_base.py", line 37, in call mutated_instance_list = self._get_mutated_instance(instance, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\mpmac\anaconda3\envs\iris\Lib\site-packages\easyjailbreak\mutation\rule\Translate.py", line 38, in _get_mutated_instance new_seed = self.translate(seed) ^^^^^^^^^^^^^^^^^^^^ File "d:\Users\mpmac\Documents\GitHub\IRISS\IRIS\src\iris\augmentations\instruction_augmentations\jailbreaks\multilingual_jailbreak.py", line 28, in translate self.cache_storage.cache(translation, text) TypeError: CacheStorage.cache() missing 1 required positional argument: 'temperature' 0%| | 0/100 [00:00<?, ?it/s] วันนี้ลองรัน multilingual ใหม่ แล้วเจอปัญหานี้ครับ