isi-vista / cdse-covid

Claim detection & semantic extraction (Covid-19 domain)
0 stars 0 forks source link

Investigation into Semantic Similarity for Claim Detection #97

Open joecummings opened 2 years ago

joecummings commented 2 years ago

Q: How close in vector space are all the COVID-19 names? A: COVID-19, Coronavirus, SARS-CoV-2, coronavirus, corona virus, covid-19, covid 19 1.0000, 0.7594, 0.7089, 0.7594, 0.7195, 1.0000, 0.9592

Most likely, this isn't close enough and therefore, all COVID names should be replaced with a single reference.

joecummings commented 2 years ago

High-level thoughts: with all COVID-19 variations normalized, semantic similarity is fairly decent at finding sentences that contain claims related to the templates used; however, extra parsing is needed in order to find the claim within the sentence.

joecummings commented 2 years ago

Possible Idea: Still use Claimbuster to extract the initial claims, then use SS to match the topics to the templates.

joecummings commented 2 years ago

PhraseBERT might actually work better than raw SentenceBERT b/c it is less dependent on lexical similarity as a metric. (need to look into this more.)

joecummings commented 2 years ago

PhraseBERT appears to run much slower than standard SentenceBERT. Is this because it chunks the sentence in phrases in decoding?

joecummings commented 2 years ago

Need to replace X with something neutral.

joecummings commented 2 years ago

Running SS over chunks from 10 documents:

Process:

Ex: "The video even goes as far as accusing France of creating the virus and releasing it in Wuhan" --> "The video even goes" "as far as accusing" "France of creating the" "virus and releasing it" "in Wuhan [PAD] [PAD]"

Result:

ss_4_chunks.csv