piisa / muliwai

experimental PII framework
Apache License 2.0
10 stars 6 forks source link

Invalid indentation in `ner_manager.py` #2

Closed yenson-lau closed 2 years ago

yenson-lau commented 2 years ago

https://github.com/piisa/muliwai/blob/main/ner_manager.py#L364 leads to invalid indentation

def spacy_ner(...):
      # ...
      if  nlp is None:
      #init spacy pipeline      # <--- this indent (possibly below) needs to be fixed!
      if src_lang == 'en':
        nlp = self.en_spacy_nlp
      # ...
huu4ontocord commented 2 years ago

Fixed. Lmk if it works for you. Also feel free to PR anything.

yenson-lau commented 2 years ago

Thanks. Unfortunately I couldn't get detect_ner_with_hf_model() to work with GPU. It's definitely not anything on my end. Can you check what's going on?

I'm calling

apply_anonymization("Bob and Amy are eating apples in Jack's home.", device="cuda")["text"]

>> RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)

The apply_anonymization function is pretty much the same as the one in your demo code:

def apply_anonymization(
    sentence: str,
    lang_id: str = "en",
    context_window: int = 20,
    anonymize_condition = True,
    tag_type = {'IP_ADDRESS', 'KEY', 'ID', 'PHONE', 'USER', 'EMAIL', 'LICENSE_PLATE', 'PERSON'} ,
    device: str = "cpu",
) -> str:
    """
    Params:
    ==================
    sentence: str, the sentence to be anonymized
    lang_id: str, the language id of the sentence
    context_window: int, the context window size
    anonymize_condition: function, the anonymization condition
    tag_type: iterable, the tag types of the anonymization. By default: {'IP_ADDRESS', 'KEY', 'ID', 'PHONE', 'USER', 'EMAIL', 'LICENSE_PLATE', 'PERSON'}
    device: cpu or cuda:{device_id}

    """

    if tag_type == None:
        tag_type = regex_rulebase.keys()

    lang_id = lang_id.split("_")[0]

    ner_ids = detect_ner_with_regex_and_context(
        sentence=sentence,
        src_lang=lang_id,
        context_window=context_window,
        tag_type=tag_type,
    )

    ner_persons = detect_ner_with_hf_model(
        sentence=sentence,
        src_lang=lang_id,
        device=device,
    )

    ner = list(set(ner_ids + ner_persons))
    ner.sort(key=lambda a: a[1])

    if anonymize_condition:
        new_sentence, new_ner, _ = augment_anonymize(sentence, lang_id, ner, )
        doc = {'text': new_sentence, 'ner': new_ner, 'orig_text': sentence, 'orig_ner': ner}
    else:
        new_sentence = sentence
        doc = {'text': new_sentence, 'ner': ner}

    return doc
huu4ontocord commented 2 years ago

Try it now. It was exepcting "cuda:0" or some other device number. I fixed it so it can accept just "cuda". This is in notebook: https://colab.research.google.com/drive/1olv6IMEP5SkwJb8CFyR2aZV19_9XdlZp#scrollTo=bsf83d-WFoNX

yenson-lau commented 2 years ago

Yep this works, thanks for making the update!