icebreaker-science / network

Exploring the network of concepts and methods in the chemistry using text mining techniques
2 stars 0 forks source link

Merge different spellings and abbreviations #1

Closed chaoran-chen closed 3 years ago

chaoran-chen commented 3 years ago

We have many keywords that we would like to merge. For example:

  1. Singular vs. plural: microplastic, microplastics
  2. One word vs. two words: micro plastics, microplastics
  3. Abbreviations: nuclear magnetic resonance, nmr
  4. Very similar words: phylogeny, phylogenetics

I think that we should at least be able to automatically detect cases of (1) and (2). With the help of the alias information of Wikidata, it should further also be possible to resolve (3) up to a certain degree. (4) is probably difficult?

The sub-tasks are:

michael-kamel commented 3 years ago

I think we can use NLP libraries to do this

michael-kamel commented 3 years ago

I think we should take a step back and assess why we need to merge (possibly) similar nodes, there are 2 possible main reasons

  1. Reducing the network size.
  2. Convenience for the user and preventing redundancy.

For 1,2,3 the problem seems trivial, but for 4, We might possibly end up with invalid merges. Additionally, I can speak from a CS perspective here where some similar words are used interchangeably when not used in a specific context, however, once put in specific contexts, they have 2 different meanings. In other words, similarity is not always binary.

I can start with the first subtask of identifying the similar keywords, maybe using Wikidata alongside stemming. If reducing the network size is the only thing we need to do, then maybe we do not need to go to far with this.

chaoran-chen commented 3 years ago

Yes, I agree with you. I think that we should do 1, 2 and 3 first. If possible, we could create suggestions for mergers corresponding to (4) and let our expert team decide.

chaoran-chen commented 3 years ago

I think that it would in general be a good idea to design the merging script in a way that a human can look into the identified potential mergers, reject suggestions and manually select nodes to be merged.

michael-kamel commented 3 years ago

Here are some useful findings

chaoran-chen commented 3 years ago

Endoscopic ultrasound vs. ultrasound is a nice example. Maybe we could also use these prefix-structures to find a partial hierarchy. How many keywords are there that are a prefix/suffix of another keyword?

michael-kamel commented 3 years ago

Endoscopic ultrasound vs. ultrasound is a nice example. Maybe we could also use these prefix-structures to find a partial hierarchy. How many keywords are there that are a prefix/suffix of another keyword?

If I didn't mess up the query, the exact number would be 7731 Edit: just for clarification, this is the number of keywords that are a contained in another keyword whether they are a prefix, suffix or both.

michael-kamel commented 3 years ago

We need to discuss how are we gonna merge the data once merge candidates have been determined.

@chaoran-chen I need your opinion on this.

chaoran-chen commented 3 years ago

No, I don't think that we have to merge any wikidata. The wikidata are only there to help us. We use it as long as we can and if they don't contribute to our goal, just ignore them.

Giving two keywords to merge, we have to merge them (1) in the keyword table in PostgreSQL and (2) the corresponding nodes in Neo4j. Additionally, we should keep the different spellings: we could just add another column to the keyword table.

vordemann commented 3 years ago

This issue has nothing to do with the wikidata and we do not plan to merge wikidata into the keywords for our network. See it more as additional information to improve our network structure.

Issue https://github.com/icebreaker-science/network/issues/2 and https://github.com/icebreaker-science/network/issues/5 explain how we want to use the wikidata for categorizing our network keywords. There is no direct dependency between the categorization and the merging, just that we have to rerun the categorization method after we merged and changed names of. keywords.

If there are different names and abbreviations for the same node worth keeping proceed the following way.

  1. Decide on the keyword that should resemble the remaining node.
  2. Add all other names to a list which you add to the remaining node
  3. Merge all other nodes with the remaining node.
  4. Rerun categorization

If a person now searches for a specific keyword in the network graph but uses an abbreviation, the lookup function should also look into the list of abbreviations and return a node if it is a match. If this implementation however is too demanding for database request, one could also create a new abbreviation table stores this n:1 mapping and the lookup function must then check this table instead.

This is done under the assumption that all the different names are sitting in the same hierarchy within the wikidata datasets and that we therefore only base categorization on the main keyword name.