The project is a complete end-to-end solution for generating knowledge graphs from unstructured data. NER can be run on input by either NLTK, Spacy or Stanford APIs. Optionally, coreference resolution can be performed which is done by python wrapper to stanford's core NLP API. Relation extraction is then done using stanford's open ie. Lastly, post-processing is done to get csv file which can be uploaded to graph commons to visualize the knowledge graph.
More details can be found in the Approach folder.
python3 knowledge_graph.py spacy
You can provide several arguments to knowledge_graph.py. For a more detailed list, refer the running knowledge_graph.py section belowpython3 relation_extractor.py
python3 create_structured_csv.py
The following installation steps are written w.r.t. linux operating system and python3 language.
python3 -m venv <path_to_env/env_name>
source path_to_env/env_name/bin/activate
pip3 install spacy
python3 -m spacy download en_core_web_sm
pip3 install nltk
python -m nltk.downloader all
pip3 install stanfordcorenlp
sudo apt-get install python3-tk
pip3 install pandas
Performs Named Entity Recognition (NER) on input data by using either NLTK, Spacy or Stanford (or all of them). Also performs coreference resolution. The coreference is used by relation_extractor.py . The recognised NER are used by create_structured_csv.py
Will only run on linux like operating systems, with paths like abc/def/file.txt
Please note that coreference resolution server requires around 4GB of free system RAM to run. If this is not available, stanford server may stop with an error or thrashing may cause program to run very slowly.
python3 knowledge_graph.py <options>
options:
e.g.:
python3 knowledge_graph.py optimized verbose nltk spacy
will o/p ner via nltk and spacy, and perform coreference resolution
The input unstructured data files must be in ./data/input folder. I.e. data folder must be in same dir as knowledge_graph.py
data/output/ner --- contains recognised named entities
data/output/caches --- Intended to contain result pickles of coreferences obtained by stanford's core nlp
data/output/kg --- contains input files with coreferences resolved