icebreaker-science / network

Exploring the network of concepts and methods in the chemistry using text mining techniques
2 stars 0 forks source link

Prep network data for visualization #5

Open vordemann opened 3 years ago

vordemann commented 3 years ago

Goal of this issue is to have a clustered first dataset for visualization of the networks. To achieve that one must do the following steps.

  1. This prep step reduces the amount of data but is not necessary. This step depends on the rollback functionality in https://github.com/icebreaker-science/network/issues/2. Use it to remove obvious useless categories and then rollback the changes into our wikidata. After that update the mapping in keyword_wikidata with https://github.com/icebreaker-science/network/blob/master/postgresql/scripts/wikidata.sql
  2. Starting point is the table keyword_wikidata where you can find a mapping of the networks keywords to the wikidata with its parents over 10 iterations. This is a 1:n mapping, since a keyowrd may appear multiple times in the wikidata.
  3. Write a script that inputs predefined categories and then iterates over all the mappings in keyword_wikidata. It then checks if it finds the category names in the parents list of each keyword. If it matches then it will add this keyword category relation to a new table called keyword_categories. A keyword in the network can be part of several categories. If a keyword in the network does not match with any of the provided categories, add it to the artificial category "Others". This is a temporary fix and will be addressed in a later issue.

The category list may change in the future but is for now: [biomolecule | chemical substance | metal | process | analytical method | biochemical relation | property]