huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.77k stars 1.02k forks source link

Watermarking cannot be detected #2474

Open vorwerkc opened 2 weeks ago

vorwerkc commented 2 weeks ago

System Info

text-generation-inference version 2.2.0 model "mistralai/Mixtral-8x7B-Instruct-v0.1"

Information

Tasks

Reproduction

  1. Start TGI with WATERMARK_GAMMA=0.25 and WATERMARK_DELTA=2
  2. Execute generate or generate_stream with watermark=True
  3. Use local dashboard from https://github.com/jwkirchenbauer/lm-watermarking with correct Tokenizer to detect watermarking

Expected behavior

From the documentation, it seems that watermarking is activated server-side for all models, and can be activated for each API-call with watermark=True.

I have not been able to detect the watermarking in any output created by "mistralai/Mixtral-8x7B-Instruct-v0.1"

ErikKaum commented 2 weeks ago

Hi @vorwerkc 👋

Thanks for reporting this. Could provide a bit more guidance in how you detect the watermarking? I'm not super familiar with the lm-watermarking code. For example what would be the best way for me to repro this locally? Thank's in advance 🙌

vorwerkc commented 1 week ago

The easiest (but not the most concise) working example, is to define the following 2 classes:

class WatermarkBase:
    def __init__(
        self,
        vocab: list[int] = None,
        gamma: float = 0.5,
        delta: float = 2.0,
        hash_key: int = 15485863,  # just a large prime number to create a rng seed with sufficient bit width
        select_green_tokens: bool = True,
    ):

        # watermarking parameters
        self.vocab = vocab
        self.vocab_size = len(vocab)
        self.gamma = gamma
        self.delta = delta
        self.rng = None
        self.hash_key = hash_key
        self.select_green_tokens = select_green_tokens

    def _seed_rng(self, input_ids: torch.LongTensor) -> None:
          assert input_ids.shape[-1] >= 1, f"seeding requires at least a 1 token prefix sequence to seed rng"
          prev_token = input_ids[-1].item()
          self.rng.manual_seed(self.hash_key * prev_token)

          return

    def _get_greenlist_ids(self, input_ids: torch.LongTensor) -> list[int]:
        # seed the rng using the previous tokens/prefix
        # according to the seeding_scheme
        self._seed_rng(input_ids)

        greenlist_size = int(self.vocab_size * self.gamma)
        vocab_permutation = torch.randperm(self.vocab_size, device=input_ids.device, generator=self.rng)
        if self.select_green_tokens:  # directly
            greenlist_ids = vocab_permutation[:greenlist_size]  # new
        else:  # select green via red
            greenlist_ids = vocab_permutation[(self.vocab_size - greenlist_size) :]  # legacy behavior
        return greenlist_ids```

class WatermarkDetector(WatermarkBase):
    def __init__(
        self,
        *args,
        device: torch.device = None,
        tokenizer: Tokenizer = None,
        normalizers: list[str] = ["unicode"],  # or also: ["unicode", "homoglyphs", "truecase"]
        **kwargs,
    ):
        super().__init__(*args, **kwargs)
        # also configure the metrics returned/preprocessing options
        assert device, "Must pass device"
        assert tokenizer, "Need an instance of the generating tokenizer to perform detection"

        self.tokenizer = tokenizer
        self.device = device
        self.z_threshold = z_threshold
        self.rng = torch.Generator(device=self.device)

        self.min_prefix_len = 1

        self.normalizers = []

    def _compute_z_score(self, observed_count, T):
        # count refers to number of green tokens, T is total number of tokens
        expected_count = self.gamma
        numer = observed_count - expected_count * T
        denom = sqrt(T * expected_count * (1 - expected_count))
        z = numer / denom
        return z

    def _compute_p_value(self, z):
        p_value = scipy.stats.norm.sf(z)
        return p_value

    def score(self, text: str):
      tokenized_text = self.tokenizer(text, return_tensors="pt", add_special_tokens=False)["input_ids"][0].to(self.device)
      return self._score_sequence(tokenized_text)

    def _score_sequence(
        self,
        input_ids: Tensor,
    ):
        num_tokens_scored = len(input_ids) - self.min_prefix_len
        if num_tokens_scored < 1:
            raise ValueError(
                (
                    f"Must have at least {1} token to score after "
                    f"the first min_prefix_len={self.min_prefix_len} tokens required by the seeding scheme."
                )
            )
        # Standard method.
        # Since we generally need at least 1 token (for the simplest scheme)
        # we start the iteration over the token sequence with a minimum
        # num tokens as the first prefix for the seeding scheme,
        # and at each step, compute the greenlist induced by the
        # current prefix and check if the current token falls in the greenlist.
        green_token_count, green_token_mask = 0, []
        for idx in range(self.min_prefix_len, len(input_ids)):
            curr_token = input_ids[idx]
            greenlist_ids = self._get_greenlist_ids(input_ids[:idx])
            if curr_token in greenlist_ids:
                green_token_count += 1
                green_token_mask.append(True)
            else:
                green_token_mask.append(False)

        z_score=self._compute_z_score(green_token_count, num_tokens_scored)
        p_value = self._compute_p_value(z_score)
        return z_score, p_value

Once you have access to the tokenizer of the model in TGI, you can define a WatermarkDetector as

detector = WatermarkDetector(
    vocab=list(tokenizer.get_vocab().values()),
    gamma=0.25,
    device=device,
    tokenizer=tokenizer,
    z_threshold=4.0,
    normalizers=[],
    select_green_tokens=True
)

The gamma and beta parameter have to be identical to the ones defined in the TGI instance. You can then detect any LLM-generated string with

detector.score(string)

The Z-score should be large (ideally over 4) and the p-value should vanish.

I looked into the original code (https://github.com/jwkirchenbauer/lm-watermarking) and found that the detection code has to run on the same device as the watermarking code, typically CUDA. Otherwise the random-number generators will be inconsistent. Even when running on CUDA, the detection seems to be inconsistent.