ohcnetwork / life

Verified Crowd-Sourced Emergency Services Directory
https://life.coronasafe.network/
84 stars 56 forks source link

De-dupe entries #95

Closed avi-jain closed 3 years ago

avi-jain commented 3 years ago

build something that can identify duplicates, delete it from frontend and give it to us, we verify it and then put it back to the system. Originally posted by @chicksim in https://github.com/coronasafe/life/issues/88#issuecomment-826361227

Let me know if this is what you want Script in Python that takes in a JSON, finds the duplicates deletes them and spits out two files - deduped JSON (used by frontend) and duplicate entries. If the duplicate entries seem correct, you delete them from the backend (Airtable). Note - The cost of a false positive is higher (you don't want to remove an entry that might not be a duplicate), so the implementation (fuzzy string matching) should imo be lenient.

avi-jain commented 3 years ago
from itertools import chain
import os
import json

DATA_KEY = "data"
ID_KEY = "id"
DATA_DIR = "data"
def read_json_data(filename, dedup_field="name", keep="createdTime"):
    """
    @param  filename
    @param  dedup_fields
    @param  keep - column to keep (most recent timestamp)
    @returns dumps deduped json, dumps duplicates
    """
    with open("../DATA_DIR/" + filename, "r") as f:
        data = json.load(f)

    reverse_dict = {}
    for item in data[DATA_KEY]:
        reverse_dict.setdefault(item[dedup_field], list()).append((item[keep], item[ID_KEY]))

    duplicates = { key:values for key, values in reverse_dict.items() if len(values) > 1}
    name_to_recent_map = {}
    name_to_duplicate_map = {}

    for name, values in duplicates.items():
        sorted_by_keep_col = sorted(values, key = lambda x:x[0])
        name_to_recent_map[name] = sorted_by_keep_col[0][1]
        name_to_duplicate_map[name] =  [lis[1] for lis in sorted_by_keep_col[1:]] 

    new_data = [d for d in data[DATA_KEY] if d.get('id') not in list(chain.from_iterable(name_to_duplicate_map.values()))]
    duplicate_data = [d for d in data[DATA_KEY] if d.get('id') in list(chain.from_iterable(name_to_duplicate_map.values()))]

    data[DATA_KEY] = new_data
    with open("deduped_" + filename, "w") as f:
        json.dump(data[DATA_KEY], f)
    data[DATA_KEY] = duplicate_data
    with open("duplicates_" + filename, "w") as f:
        json.dump(data[DATA_KEY], f)

if __name__ == "__main__":

    read_json_data("hospital_clinic_centre.json")

@chicksim hastily wrote this in case you want to use this. This does exact match, can extend to fuzzy match. Create a scripts folder and dump this in a python file and run with different file names.