ltscomputingllc / faersdbstats

Generate safety signal statistics from FDA FAERS data
Apache License 2.0
45 stars 29 forks source link

Removal of duplicate cases - Same ISR but different caseid #8

Open elpidakon opened 5 years ago

elpidakon commented 5 years ago

In LAERS, a number of reports is duplicated, having the same ISR number but different CASE ID numbers (i.e. two different CASE ID numbers have been assigned to the same report). This issue is not addressed using the script derive_unique_all_case.sql . This can be checked by running the following query:

SELECT * 
FROM unique_all_casedemo
WHERE isr IN (SELECT isr
               FROM unique_all_casedemo
               GROUP BY isr HAVING COUNT(*) > 1)
ORDER BY isr;

In the final table unique_all_case, currently there exist 943 rows that have an ISR number that also appears in another row of the table (with a different CASE ID).

A suggested modification to the existing code is the following (after the step that creates the table unique_all_casedemo):

-- remove duplicates with same isr and different CASE ID
drop table if exists unique_all_casedemo_2;
create table unique_all_casedemo_2 as
select database, caseid, isr, caseversion, i_f_code, event_dt, age, sex, reporter_country, primaryid, drugname_list, reac_pt_list, fda_dt
from (
select *, 
row_number() over(partition by isr order by filename desc, caseid desc) as row_num 
from unique_all_casedemo
where isr is not null
) a where a.row_num = 1
union
select database, caseid, isr, caseversion, i_f_code, event_dt, age, sex, reporter_country, primaryid, drugname_list, reac_pt_list, fda_dt
from unique_all_casedemo where isr is null;

-- remove any duplicates based on fully populated matching demographic key fields and exact match on list of drugs and list of outcomes (FAERS reactions)
-- NOTE. when using this table for subsequent joins in the ETL process, join to FAERS data using primaryid and join to LAERS data using isr
drop table if exists unique_all_case;   
create table unique_all_case as
select caseid, case when isr is not null then null else primaryid end as primaryid, isr 
from (
    select caseid, primaryid,isr, 
    row_number() over(partition by event_dt, age, sex, reporter_country, drugname_list, reac_pt_list order by primaryid desc, database desc, fda_dt desc, i_f_code, isr desc) as row_num 
    from unique_all_casedemo_2 
    where caseid is not null and event_dt is not null and age is not null and sex is not null and reporter_country is not null and drugname_list is not null and reac_pt_list is not null
) a where a.row_num = 1
union 
select caseid, case when isr is not null then null else primaryid end as primaryid, isr 
from unique_all_casedemo_2 
where caseid is null or event_dt is null or age is null or sex is null or reporter_country is null or drugname_list is null or reac_pt_list is null;