This project was developed as part of the "KI-Projekt" course during the summer term of 2024 at OTH Regensburg by Ben, Felix, Lars and Simon. It is designed to be used for scientific research only. The goal of the project is to scrape and match german news articles from sources that provide content in both easy (in german: "leichte" or "einfache Sprache") and standard language. For simplicity, we refer to the articles as easy or hard. Currently supported sources for the articles are:
The code is built modularly. Main modules are:
Modules may be used individually as needed.
The project uses a custom data structure consisting of folders and files (txt, json, csv, html, mp3) to store the scraped data. The The data is stored in the git root directory like:
data/
├── <source>/ (dlf or mdr)
│ ├── matches_<source>.csv
│ ├── <language niveau>/ (easy or hard)
│ │ ├── lookup_<source>_<niveau>.csv
│ │ ├── 2023-06-01-Sample_Article/
│ │ │ ├── Metadata.json
│ │ │ ├── Content.txt
│ │ │ ├── Raw.html
│ │ │ └── Audio.mp3 (if available)
On runtime the data can be read into Pandas DataFrames with the DataHandler read capability.
git clone https://github.com/larsaars/KIP_EinfachErklaert.git
cd KIP_EinfachErklaert
pip install -r requirements.txt
The scrapers are designed to be executed on a regular basis (e.g., by weekly cron jobs on a server). The following table shows the most important scrapers with a short explanation and frequency:
File | Functionality |
---|---|
scrapers/dlf/scrape_Deutschlandfunk.py |
Scrapes todays articles from Deutschlandfunk (hard) |
scrapers/dlf/scrape_Nachrichtenleicht.py |
Scrapes articles of several weeks (depending on how many articles specified in the api) from Nachrichtenleicht (easy) |
scrapers/dlf/scrape_InstaCaptions.py |
Scrapes captions of all posts on the "nachrichtenleicht" Instagram profile and analyzes images for titles |
scrapers/mdr/current_news_scraper.py |
Scrapes five days of articles easy and hard articles from MDR |
scrapers/mdr/historic_news_scraper.py |
Scrapes old easy and hard articles from MDR |
The DataHandler is not an executable but a module to use when further developing scrapers or matcher and dealing with data storage (read, write search). Examples how to use the DataHandler can be found here.