Merge different spellings and abbreviations

chaoran-chen commented 4 years ago

We have many keywords that we would like to merge. For example:

Singular vs. plural: microplastic, microplastics
One word vs. two words: micro plastics, microplastics
Abbreviations: nuclear magnetic resonance, nmr
Very similar words: phylogeny, phylogenetics

I think that we should at least be able to automatically detect cases of (1) and (2). With the help of the alias information of Wikidata, it should further also be possible to resolve (3) up to a certain degree. (4) is probably difficult?

The sub-tasks are:

[ ] Identify mergeable keywords
[ ] Merge the data in the keyword table in PostgreSQL
[ ] Merge the nodes in the Neo4j database

michael-kamel commented 4 years ago

I think we can use NLP libraries to do this

michael-kamel commented 4 years ago

I think we should take a step back and assess why we need to merge (possibly) similar nodes, there are 2 possible main reasons

Reducing the network size.
Convenience for the user and preventing redundancy.

For 1,2,3 the problem seems trivial, but for 4, We might possibly end up with invalid merges. Additionally, I can speak from a CS perspective here where some similar words are used interchangeably when not used in a specific context, however, once put in specific contexts, they have 2 different meanings. In other words, similarity is not always binary.

I can start with the first subtask of identifying the similar keywords, maybe using Wikidata alongside stemming. If reducing the network size is the only thing we need to do, then maybe we do not need to go to far with this.

chaoran-chen commented 4 years ago

Yes, I agree with you. I think that we should do 1, 2 and 3 first. If possible, we could create suggestions for mergers corresponding to (4) and let our expert team decide.

chaoran-chen commented 4 years ago

I think that it would in general be a good idea to design the merging script in a way that a human can look into the identified potential mergers, reject suggestions and manually select nodes to be merged.

michael-kamel commented 4 years ago

Here are some useful findings

Some similar keywords have different aliases(alias info is not guaranteed to be reliable I guess)
ultrasound vs endoscopic ultrasound (additional context)
*matrix metalloproteinases' vs 'matrix metalloproteinase inhibitors' alkali metal alloys vs alkali metal ultrasound vs ultrasound imaging metal catalysis vs metal catalyst(thing vs process?) 13c nmr vs 13c nmr spectroscopy vs 17o nmr vs 2d nmr vs mas nmr vs pfg nmr etc... (Requires expert opinion)
Out of context stuff. If we merge the keywords' wiki data we might end up with completely out of context stuff. Maybe we should drop wikidata entries that have no publications?
I can't speak for chemistry here but some keywords might be similar in some contexts and different for others. For example, in cs people sometimes use words like concurrent and parallel interchangeably if the context is about hardware design
Some very specific keywords materials chemistry2506 metals and alloys (yes it is one keyword) or matrix assisted laser desorption ionization time of flight mass spectrometry longest keyword we have The problem here would be how to split effectively. On extreme side would be not to split which would do nothing essentially but ensure that we don't have misleading data, the other extreme would be to split all keywords by spaces and word separators in general, but we would end up with a ton of keywords that are useless on their own like. We can still use NLP libraries to remove useless words like 'of' 'and' and so on, but I don't think we can do this and yield acceptable results.
The following the is distribution of keywords we have based on how many words they have separated by space
- 10: 1
- 9: 0
- 8: 1
- 7: 1
- 6: 8
- 5: 67
- 4: 469
- 3: 3223
- 2: 19100
- 1: 13517
It's tempting to try and split the words to have a much smaller network at the end but some contexts are essential. maybe endoscopic ultrasound is a thing and ultrasound on its own is definitely a thing.

chaoran-chen commented 4 years ago

Endoscopic ultrasound vs. ultrasound is a nice example. Maybe we could also use these prefix-structures to find a partial hierarchy. How many keywords are there that are a prefix/suffix of another keyword?

michael-kamel commented 4 years ago

Endoscopic ultrasound vs. ultrasound is a nice example. Maybe we could also use these prefix-structures to find a partial hierarchy. How many keywords are there that are a prefix/suffix of another keyword?

If I didn't mess up the query, the exact number would be 7731 Edit: just for clarification, this is the number of keywords that are a contained in another keyword whether they are a prefix, suffix or both.

michael-kamel commented 4 years ago

We need to discuss how are we gonna merge the data once merge candidates have been determined.

Do we merge wikidata? and if so how should we do that?
Which data pieces should we keep/discard?
Naturally, for some of the merged keywords, it should still be possible to look them up in search.

@chaoran-chen I need your opinion on this.

chaoran-chen commented 4 years ago

No, I don't think that we have to merge any wikidata. The wikidata are only there to help us. We use it as long as we can and if they don't contribute to our goal, just ignore them.

Giving two keywords to merge, we have to merge them (1) in the keyword table in PostgreSQL and (2) the corresponding nodes in Neo4j. Additionally, we should keep the different spellings: we could just add another column to the keyword table.

vordemann commented 4 years ago

This issue has nothing to do with the wikidata and we do not plan to merge wikidata into the keywords for our network. See it more as additional information to improve our network structure.

Issue https://github.com/icebreaker-science/network/issues/2 and https://github.com/icebreaker-science/network/issues/5 explain how we want to use the wikidata for categorizing our network keywords. There is no direct dependency between the categorization and the merging, just that we have to rerun the categorization method after we merged and changed names of. keywords.

If there are different names and abbreviations for the same node worth keeping proceed the following way.

Decide on the keyword that should resemble the remaining node.
Add all other names to a list which you add to the remaining node
Merge all other nodes with the remaining node.
Rerun categorization

If a person now searches for a specific keyword in the network graph but uses an abbreviation, the lookup function should also look into the list of abbreviations and return a node if it is a match. If this implementation however is too demanding for database request, one could also create a new abbreviation table stores this n:1 mapping and the lookup function must then check this table instead.

This is done under the assumption that all the different names are sitting in the same hierarchy within the wikidata datasets and that we therefore only base categorization on the main keyword name.

icebreaker-science / network

Merge different spellings and abbreviations #1