Investigation into Semantic Similarity for Claim Detection

joecummings commented 2 years ago

Q: How close in vector space are all the COVID-19 names? A: COVID-19, Coronavirus, SARS-CoV-2, coronavirus, corona virus, covid-19, covid 19 1.0000, 0.7594, 0.7089, 0.7594, 0.7195, 1.0000, 0.9592

Most likely, this isn't close enough and therefore, all COVID names should be replaced with a single reference.

joecummings commented 2 years ago

High-level thoughts: with all COVID-19 variations normalized, semantic similarity is fairly decent at finding sentences that contain claims related to the templates used; however, extra parsing is needed in order to find the claim within the sentence.

joecummings commented 2 years ago

Possible Idea: Still use Claimbuster to extract the initial claims, then use SS to match the topics to the templates.

joecummings commented 2 years ago

PhraseBERT might actually work better than raw SentenceBERT b/c it is less dependent on lexical similarity as a metric. (need to look into this more.)

joecummings commented 2 years ago

PhraseBERT appears to run much slower than standard SentenceBERT. Is this because it chunks the sentence in phrases in decoding?

joecummings commented 2 years ago

Need to replace X with something neutral.

joecummings commented 2 years ago

Running SS over chunks from 10 documents:

Process:

Replace all variations of the covid name with COVID-19 in both templates and sentences.
Replace all X in the templates with someone or something.
Chunk sentences into tokens of four, adding [PAD] at the end if it is uneven.
Encode with SentenceBERT and calculate the cosine similarity between each template and each chunk.
If the similarity is above 0.67 (chosen empirically), accept it as a valid claim match.

Ex: "The video even goes as far as accusing France of creating the virus and releasing it in Wuhan" --> "The video even goes" "as far as accusing" "France of creating the" "virus and releasing it" "in Wuhan [PAD] [PAD]"

Result:

158 "claims" from 10 documents
Large amount appear to match just because they mention COVID-19
Most common category matches: "SARS-CoV-2 is X"

ss_4_chunks.csv

isi-vista / cdse-covid

Investigation into Semantic Similarity for Claim Detection #97