Closed jaklinger closed 4 years ago
nih_vectors is :( because querying for AbstractVector.application_id in nih_vectors.py:41 returns [] which can't be unpacked by done_ids, =
Ah thanks for catching this. "Fixed" it with an except ValueError
for this case.
faiss-cpu -c pytorch
Thanks, updated the reqs
@bishax Yes sorry I was nearly about to commit those changes - sorry got caught up in another PR...
Just pushed the changes (you can see all changes from the last couple of days here)
Refers to #326
Generate doc vectors using the
Text2VecTask
task, and then run a FAISS indexer to generate a link table of exact- and near-duplicate abstracts and PHR fields.Tasks:
This replaces the previous method of ingesting the data into one index on ES, and then running doc similarity before filling a second index on ES. This process was quite laborious, and also wastes an index and disk space. The bonus benefit of this method is that we get the doc vectors.
These haven't been chained together yet (this will happen in the final PR in this series), but the commands to run the two pipelines are:
and
By "similarity score" (defined in detail here), the following numbers of duplicates (without double-counting) of the NiH PHR field are:
And some examples are (note that I've literally done zero cherry picking here):
Score == 0.8 (near duplicates)
Score == 0.7 (Probably duplicates)
Score == 0.6 (Contextually very similar)
Score == 0.5 (Contextually fairly similar)