Closed avi-jain closed 3 years ago
from itertools import chain
import os
import json
DATA_KEY = "data"
ID_KEY = "id"
DATA_DIR = "data"
def read_json_data(filename, dedup_field="name", keep="createdTime"):
"""
@param filename
@param dedup_fields
@param keep - column to keep (most recent timestamp)
@returns dumps deduped json, dumps duplicates
"""
with open("../DATA_DIR/" + filename, "r") as f:
data = json.load(f)
reverse_dict = {}
for item in data[DATA_KEY]:
reverse_dict.setdefault(item[dedup_field], list()).append((item[keep], item[ID_KEY]))
duplicates = { key:values for key, values in reverse_dict.items() if len(values) > 1}
name_to_recent_map = {}
name_to_duplicate_map = {}
for name, values in duplicates.items():
sorted_by_keep_col = sorted(values, key = lambda x:x[0])
name_to_recent_map[name] = sorted_by_keep_col[0][1]
name_to_duplicate_map[name] = [lis[1] for lis in sorted_by_keep_col[1:]]
new_data = [d for d in data[DATA_KEY] if d.get('id') not in list(chain.from_iterable(name_to_duplicate_map.values()))]
duplicate_data = [d for d in data[DATA_KEY] if d.get('id') in list(chain.from_iterable(name_to_duplicate_map.values()))]
data[DATA_KEY] = new_data
with open("deduped_" + filename, "w") as f:
json.dump(data[DATA_KEY], f)
data[DATA_KEY] = duplicate_data
with open("duplicates_" + filename, "w") as f:
json.dump(data[DATA_KEY], f)
if __name__ == "__main__":
read_json_data("hospital_clinic_centre.json")
@chicksim hastily wrote this in case you want to use this. This does exact match, can extend to fuzzy match. Create a scripts folder and dump this in a python file and run with different file names.
build something that can identify duplicates, delete it from frontend and give it to us, we verify it and then put it back to the system. Originally posted by @chicksim in https://github.com/coronasafe/life/issues/88#issuecomment-826361227
Let me know if this is what you want Script in Python that takes in a JSON, finds the duplicates deletes them and spits out two files - deduped JSON (used by frontend) and duplicate entries. If the duplicate entries seem correct, you delete them from the backend (Airtable). Note - The cost of a false positive is higher (you don't want to remove an entry that might not be a duplicate), so the implementation (fuzzy string matching) should imo be lenient.