Experiment to find record similarity based on text analysis

pvgenuchten commented 1 month ago

Fetch harmonized records from api or database
use an algorithm which detects similarity between 2 records based on title, abstract, keywords
consider translated content and make sure records are not describing same source (share ID)

Can be used to identify similar records (‘more like this’) or records describing the same source

DoD:

[x] extract 100 items from API
[x] convert json to pandas dataframe
[x] select title, abstract, keywords and make 1 'embedding' field
[x] convert 'embedding' field to vector using transformer (BERT, LLM, ... on tensoflow hub, huggingface, ...)
[x] compare embeddings of items 1-1 using e.g. cosinus similarity and determine similarity metric
[x] order dataframe with item combinations based on highest similarity metric
[ ] determine thresholds of similarity metric for 'duplicate' and 'similarity' (probably not possible with small sample set of 100 items)

BerkvensNick commented 1 month ago

I have used a jupyter notebook (will add the code to the repository) to

download 50 records from the database api,,
select title, abstract, keywords and make 1 'embedding' field
convert 'embedding' field to vector using a BERT transformer model from huggingface
compare embeddings of items 1-1 using e.g. cosinus similarity and determine similarity metric
order the dataframe with item combinations based on highest similarity metric

the results are in the associated excell file

duplicate_datasheet_soilwise.xls

BerkvensNick commented 1 month ago

@pvgenuchten I did a first experiment (results and description of methodology in previous comment), but was not able to set up the 'pagination' to extract more than 50 records, what is the parameter I have to use for the api to get to the next page? And is it correct that the limit is per 50 records?

pvgenuchten commented 1 month ago

The parameters to manage pagination are /items?offset=5&limit=5

BerkvensNick commented 1 month ago

Ok, thanks @pvgenuchten !! I adjusted the code using these parameters and was able to extract 500 records, I have generated the mutual similarity values between all 500 records, the excell is too large to upload (540MB and 249500 rows), I have uploaded a subsample of the top 5000 rows. Have also uploaded modified notebook to repository.

The highest similarity= 0.9999997, this is for 2 records (68749995-c4bf-4f80-94e5-43c2291c99be and 5187f8c5-38ef-4b07-bc26-a5e257a8ef59) with different id's without title and description and with the same keywords; guess this is a data quality issue from the source.

when going through the results the top similarity metrics look good, e.g.

similarity = 0.99996 for combination 49bebaf8-bae4-4748-8e5c-ce80c0406953 and 56fcf114-1c1e-46ac-b21a-b43ff7441335 with titles "SUSALPS temperature and volumetric soil water content Graswang Subplot 2 in Fendt extensiv" and "SUSALPS temperature and volumetric soil water content Graswang Subplot 1 in Fendt extensiv" (description and keywords the same),

or similarity= 0.99991 for combination 56fcf114-1c1e-46ac-b21a-b43ff7441335 and 07388e86-f38b-469a-9910-6e24af66bbf5 with titles "SUSALPS temperature and volumetric soil water content Graswang Subplot 1 in Fendt extensiv" and "SUSALPS temperature and volumetric soil water content Graswang Subplot 1 in Fendt intensiv" and descriptions "..... This dataset contains daily average soil temperature and volumetric soil water content in 5 and 15 cm depth. Treatment: Graswang Subplot 1 in Fendt extensiv"" and " .... This dataset contains daily average soil temperature and volumetric soil water content in 5 and 15 cm depth. Treatment: Graswang Subplot 1 in Fendt intensiv Device: Decagon 5TM Timescale: Daily average Depths: 5 and 15 cm" (keywords the same).

However when going to the lower rows, the similarity seems quite high in my opinion, this could be due to the fact that many (scientific) 'niche' words are being used which the current model (distilbert) doesn't contain, other transformers or LLM's might perform better in this respect

Maybe would be good to have @wbcbugfree or @Max-at-Vlaanderen also have a look at the code/approach for input/modifications/suggetions

I guess next steps could be:

[ ] experiment with the code on test set containing actual duplicates? (can we harvest duplicates or do we manipulate a test set to have duplicates?)
[ ] have someone with more 'business' knowledge go through the results to evaluate

other ideas/requirements?

top1000_duplicate_datasheet_soilwise_500_records.xls

Max-at-Vlaanderen commented 1 month ago

Hi, I took a quick look at the results. Looks like indeed what we can expect from an embedding model. I do fear that we won't be able to use this approach is to use it directly for duplication detection. The essence of these models is to generalise to a "topicvector" about what the text is about. In my view, duplicate detection goes a bit further. -> is an approach of compartmentalised exact matches not better in this? (author with author, title with title, description with description,...) and then come up with our own score to be able to conclude whether this is a duplica, a duplica with gaps, or a similar study.

Other than that, I think this approach will be ideal for search queries and recomendation systems.

BerkvensNick commented 1 month ago

thanks @Max-at-Vlaanderen , I think your suggestion for duplicate detection is quite interesting!

I also looked at Euclidean distance (couldn't find that algorithm you suggested @Max-at-Vlaanderen), I see some differences in the order of the similar-pairs, but not that big in my opinion when looking at the most similar. @robknapen do you have any knowledge on which similarity algorithm is most suitable to identify the similarity between records based on the title - description - keywords? duplicate_datasheet_soilwise_euclidean.xls duplicate_datasheet_soilwise_cosinesim.xls

wbcbugfree commented 1 month ago

Hi Nick, I think the demo is great. But I have the same suggestion as Max, i.e. we should embed authors, titles, and descriptions of metadata records and match them separately. This way only three items with high similarity (above a certain threshold) can be judged as duplicates, the others should be judged as similar. Besides, for the last entry in the DoD list, I personally don't think it's a good idea to set a one-size-fits-all threshold to distinguish between similarities and duplicates. Embedding text and calculating similarity can be used to locate potentially similar and duplicate metadata records. After we have successfully located it, we may use some more precise ways to determine if it is a duplicate or similar. Some traditional pattern-based algorithms (e.g. n-gram similarity) and even simple regular expressions have the opportunity to accomplish this second step of judgment. Thanks for mentioning me, I hope this helps.

BerkvensNick commented 1 month ago

@pvgenuchten and @roblokers , I have worked further on the suggestion by Max and Beichem on calculating the similarity metrics for 'title', 'description' and keywords' each separately to be able to better evaluate duplicates. The notebook is added to the repository. I analyzed 500 records and only kept pairs where at least one of the 3 fields had similarity 1 (to reduce size of the file), the results are in the attached file

duplicate_datasheet_soilwise_cosinesim.xls

robknapen commented 2 weeks ago

Hi, I would also say that the different fields need to be processed and compared based on the characteristics/type of the data. if there is sufficient text in a field an embedding might help to figure out semantic similarity (dot product is also an often used algorithm, but similar to cosine). Then to find actual duplicates probably other rules would apply than in case you want to find similar records from a collection. There is knowledge (literature :-) ) on duplicate record removal from databases, and how this can be done. But no perfect solution as far as I know, working with text will always be tricky due to its nature.

soilwise-he / similarity-finder

Experiment to find record similarity based on text analysis #8