NED extraction from the text (filter)

Enhancement of Document Analysis for Named Entity Data (NED) Extraction

Overview

The current document analysis system needs to be upgraded to extract Named Entity Data (NED) such as language, dates, and locations with greater precision and more organized output structure.

Deliverables

A Python module should be developed that:

Takes a text string as an input.
Analyzes the text to extract the language, dates, and locations.
Returns the extracted data in a structured Python object.

Detailed Description

The focus of the NED extraction from documents should be on the following areas:

Language Detection: The langdetect library should be used for detection. The output should consist of the language code and its English name. This will help prioritize articles in the same language that the user speaks.
Date Extraction: Dates should be extracted from various document formats. Initially, SpaCy can be used for extraction, and other tools can be evaluated for better accuracy. If no date is present in the document, the retrieval date should be used. This is to ensure that the data is up-to-date and relevant.
Location Identification: Locations should be extracted and converted into both printable addresses and cartesian coordinates relative to Munich for proximity calculations. Each detected location should be cross-referenced with the language of the document and the domain of the source to enhance the accuracy and relevance of the geographic data. This will help in detecting structures and programs that match the user's location.

Acceptance Criteria

The following tasks should be completed to resolve this issue:

[ ] Language:
- Detect the language of the document.
- Output the language code and its English translation in a Python object.
[ ] Dates:
- Extract the most recent date from the document or use the current date if none is found.
- Output the date in a Python object.
[ ] Locations:
- Extract and verify all locations mentioned in the document.
- Cross-reference the detected locations with the document’s language and source domain to filter out irrelevant locations.
- Output these locations as an array of geotagged objects, including cartesian coordinates in a Python object.

Resources and Tools

LangDetect Python Library: LangDetect Official Documentation
SpaCy Entity Recognizer: SpaCy NER Documentation
Nominatim Geolocation API: Nominatim API Documentation

Special Cases and Exceptions

Documents with relative dates like "updated X months ago" should be ignored.
Documents with multiple geographical references should be handled by storing the most specific locations. In cases of broad references, such as mentions of multiple cities or regions within a single country, all relevant geographic details should be stored.

Expected Output Format

The module should output a Python object formatted as follows:

{
  "language": {"code": "<language code>", "print": "<print version in English>"},
  "date": "<python datetime of the latest detected date, or current date>",
  "locations": [
    {"country": "<country>", "address": "<print address>", "lat": "<latitude>", "lon": "<longitude>", "x": "<km from Munich>", "y": "<idem>"}
  ]
}

The same filter will be applied to RAG documents and user chat flows.

svoi-fr / mirai