poteminr / instruct-ner

Instruct LLMs for flat and nested NER. Fine-tuning Llama and Mistral models for instruction named entity recognition. (Instruction NER)
Apache License 2.0
74 stars 8 forks source link

unable to get the input format #1

Closed pavanbaswani closed 1 year ago

pavanbaswani commented 1 year ago

could you please mention the input format to train the models.

poteminr commented 1 year ago

Hi! Of course, today I’m going to provide examples for custom NER dataset.

poteminr commented 1 year ago

@pavanbaswani You should prepare dicts:

{'instruction': 'Ты решаешь задачу NER. Извлеки из текста слова, относящиеся к каждой из следующих сущностей: Drugname, Drugclass, DI, ADR, Finding.',
 'input': 'Это старый-добрый Римантадин, только в сиропе.\n',
 'output': 'Drugname: Римантадин\nDrugclass: \nDrugform: сиропе\nDI: \nADR: \nFinding: \n',
 'source': '### Задание: Ты решаешь задачу NER. Извлеки из текста слова, относящиеся к каждой из следующих сущностей: Drugname, Drugclass, DI, ADR, Finding.\n### Вход: Это старый-добрый Римантадин, только в сиропе.\n### Ответ: ',
 'raw_entities': {'Drugname': ['Римантадин'],
  'Drugclass': [],
  'Drugform': ['сиропе'],
  'DI': [],
  'ADR': [],
  'Finding': []},
 'id': '1_2555494.tsv'}
  1. Where instruction - general rule for every sample. You can find it in instruction_ner/flat_utils/instruct_utils.py
  2. input - text for NER task.
  3. output - automatically generated from raw_entities.
  4. source - apply MODEL_INPUT_TEMPLATE from instruction_ner/flat_utils/instruct_utils.py
  5. raw_entities - dict with parsed named entities

Thus, you should:

  1. prepareinput
  2. change MODEL_INPUT_TEMPLATE
  3. prepare raw_entities

Feel free to ask any questions