[326a] Deduplication with text vectors

jaklinger commented 4 years ago

Refers to #326

Generate doc vectors using the Text2VecTask task, and then run a FAISS indexer to generate a link table of exact- and near-duplicate abstracts and PHR fields.

Tasks:

[x] Generate document vectors for PHR and abstract fields
[x] Generate a FAISS index over the vectors for exact- and near-deduplication
[x] Process the duplicates into a link table

This replaces the previous method of ingesting the data into one index on ES, and then running doc similarity before filling a second index on ES. This process was quite laborious, and also wastes an index and disk space. The bonus benefit of this method is that we get the doc vectors.

These haven't been chained together yet (this will happen in the final PR in this series), but the commands to run the two pipelines are:

luigi --module nih_vectors RootTask

and

luigi --module nih_dedupe_task RootTask

By "similarity score" (defined in detail here), the following numbers of duplicates (without double-counting) of the NiH PHR field are:

+--------+----------+---------+---------+
| 0-0.55 | 0.55-0.6 | 0.6-0.8 | 0.8+    |
+--------+----------+---------+---------+
|  86176 |    52322 |  116581 | 1169913 |
+--------+----------+---------+---------+

And some examples are (note that I've literally done zero cherry picking here):

Score == 0.8 (near duplicates)

*************************** 1. row ***************************
           fy: 2016
project_title: MARC Honors Undergraduate Research Training Program
project_start: 1978-06-01 00:00:00
  project_end: 2018-05-31 00:00:00
     org_name: City College Of New York
          phr: The goal of the proposed MARC U*STAR program is to further increase the number of qualified
 URM undergraduates who continue their education at the Ph.D. level in the biomedical sciences. This will
 enhance public health by increasing the pool of biomedical scientists working to study the underlying bases
 of biological phenomena and diseases.
*************************** 2. row ***************************
           fy: 2019
project_title: MARC Honors Undergraduate Research Training Program
project_start: 1978-06-01 00:00:00
  project_end: 2023-05-31 00:00:00
     org_name: City College Of New York
          phr: The goal of the proposed MARC U-STAR program is to further increase the number of qualified
 undergraduates from underrepresented groups who continue their education at the Ph.D. level in the 
biomedical and behavioral sciences. This will enhance public health by increasing the pool of biomedical
scientists working to study the underlying bases of biological phenomena and diseases.

Score == 0.7 (Probably duplicates)

*************************** 1. row ***************************
           fy: 2016
project_title: Embryonic inheritance of sperm methylome after adult exposure to phthalates
project_start: 2016-08-01 00:00:00
  project_end: 2018-07-31 00:00:00
     org_name: University Of Massachusetts Amherst
          phr: This project investigates how adult exposures to phthalates affects DNA methylation of sperm
 and subsequently the embryo. The study has important implications for the effect of environmental 
endocrine disruptors in reproductive and offspring health.
*************************** 2. row ***************************
           fy: 2019
project_title: Male preconception phthalates and offspring embryo and sperm allele-specific methylome programming
project_start: 2017-09-01 00:00:00
  project_end: 2022-08-31 00:00:00
     org_name: University Of Massachusetts Amherst
          phr: This project investigates the effect of adult male exposures to phthalates on DNA methylation of 
sperm as well as embryonic tissue and germ cells of adult offspring. The study has important implications 
for the effect of environmental endocrine disruptors in reproductive and offspring health.

Score == 0.6 (Contextually very similar)

*************************** 1. row ***************************
           fy: 2014
project_title: Novel Focal Adhesion Kinase autophosphorylation inhibitors against pancreatic cancer
project_start: 2014-09-24 00:00:00
  project_end: 2014-12-17 00:00:00
     org_name: Curefaktor Pharmaceuticals, Llc
          phr: Pancreatic cancer is a highly lethal disease with the worst prognosis among all cancer types and
 development of new treatments is critical for pancreatic cancer therapy. This application is focused on 
developing novel small molecule inhibitors targeting of Focal Adhesion Kinase (FAK) autophosphorylation to
 block pancreatic cancer cell growth and functions with the goal to identify best inhibitors for future clinical 
study. The preliminary data and this proposal provide a strong basis that novel inhibitors of FAK will have a 
significant impact on the therapeutic treatment of pancreatic cancer and on public health programs.
*************************** 2. row ***************************
           fy: 2019
project_title: Targeting LIF/LIFR in pancreatic cancer
project_start: 2019-04-02 00:00:00
  project_end: 2020-04-30 00:00:00
     org_name: Evestra, Inc.
          phr: Pancreatic cancer has one of the worst prognoses of all cancers due to intense stroma that 
impedes the perfusion of chemotherapeutics and impact therapy resistance to the cancer cells. The 
proposed research seeks to evaluate the therapeutic potential of small steroidal molecule EC359 in 
combination with gemcitabine that will simultaneously target pancreatic tumor stroma and cancer cells. The
 knowledge thus gained will provide a novel therapeutic regimen for the cure and management of pancreatic
 cancer patients.

Score == 0.5 (Contextually fairly similar)

*************************** 1. row ***************************
           fy: 2014
project_title: Fathering Through Change: Online Parent Training for Divorced Fathers (FTC)
project_start: 2013-03-01 00:00:00
  project_end: 2016-07-31 00:00:00
     org_name: Iris Media, Inc.
          phr: Divorced fathers experience high rates of depression and anxiety due to parenting strains, and 
their children are at elevated risk for behavioral problems associated with divorce and co-parenting conflict. 
Evidence based parenting programs for fathers are lacking. The aim of this proposal is to develop a 
parenting program for divorced fathers to teach essential parenting skills associated with positive child 
outcomes and reduced conflict with former spouses. The societal benefits of the proposed program include 
higher rates of compliance with child support obligations and lower rates of litigation and re-litigation.
*************************** 2. row ***************************
           fy: 2015
project_title: Promoting Healthy Development among Children of Fathers with Antisocial Behavior
project_start: 2015-08-03 00:00:00
  project_end: 2017-06-30 00:00:00
     org_name: University Of Chicago
          phr: Children of fathers with a history of incarceration and antisocial behavior are at higher risk for 
conduct problems, substance abuse, mental health disorders, and poor academic achievement. The 
negative relation between fathers' imprisonment and antisocial behavior and children's outcomes calls for 
interventions to promote positive father engagement and parenting practices. The objective of this career
 development and research plan is to obtain high-quality training and to conduct advanced research to 
inform the development and feasibility testing of a family-focused intervention to reduce fathers' 
involvement in antisocial behavior and promote positive parenting to improve prospects for their children's
 well-being.

jaklinger commented 4 years ago

nih_vectors is :( because querying for AbstractVector.application_id in nih_vectors.py:41 returns [] which can't be unpacked by done_ids, =

Ah thanks for catching this. "Fixed" it with an except ValueError for this case.

jaklinger commented 4 years ago

faiss-cpu -c pytorch

Thanks, updated the reqs

jaklinger commented 4 years ago

@bishax Yes sorry I was nearly about to commit those changes - sorry got caught up in another PR...

Just pushed the changes (you can see all changes from the last couple of days here)

nestauk / old_nesta_daps