wikipiifed - Automated dataset creation and Federted learning
This repo represent the automated dataset creation from wikipedia biography pages and utilizing the dataset for Federated learning of BERT based Named entity recognizer.
Running scraper and creating the dataset
wikipii_dataset.ipynb is a walk through of the dataset creation.
Sequence of dataset creation is:
- gathering all the links of living people from Wikipedia
- scoring the pages based on the presense of named entities hence filtering
- starting parallel workers to scrape and create csv and text files with infobox data and scraped text
- splitting the dataset to test/train/validation sets based on the entity presense score
Running federated training
remote_bert.ipynb walks through the training of the BERT-base model on the dataset with two remote workers.
notebook can be tested in the colab.
After running the Installing PySyft code block colab has to be restarted and the remaining cells can be executed.
Notebook has following sections:
- Dataset Class - class for loading the dataset from text files
- BERT Model - Modified version of Huggingface BERT-base and class for the final model
- Data Iterators - loading training/test/evaluation sets from files to dataset objects
- Training - loading model, distributing dataset to remote workers and federated training