svoi-fr / mirai

Refugee assistant bot
https://docs.danswer.dev/
MIT License
2 stars 1 forks source link

NED extraction from the text (filter) #42

Open ptitzlabs opened 1 month ago

ptitzlabs commented 1 month ago

Enhancement of Document Analysis for Named Entity Data (NED) Extraction

Overview

The current document analysis system needs to be upgraded to extract Named Entity Data (NED) such as language, dates, and locations with greater precision and more organized output structure.

Deliverables

A Python module should be developed that:

Detailed Description

The focus of the NED extraction from documents should be on the following areas:

Acceptance Criteria

The following tasks should be completed to resolve this issue:

Resources and Tools

Special Cases and Exceptions

Expected Output Format

The module should output a Python object formatted as follows:

{
  "language": {"code": "<language code>", "print": "<print version in English>"},
  "date": "<python datetime of the latest detected date, or current date>",
  "locations": [
    {"country": "<country>", "address": "<print address>", "lat": "<latitude>", "lon": "<longitude>", "x": "<km from Munich>", "y": "<idem>"}
  ]
}

The same filter will be applied to RAG documents and user chat flows.