sagitechls / SSN_SACE_2017_Jan

0 stars 3 forks source link

Merge files #10

Open Aravind-Parthiban opened 7 years ago

Aravind-Parthiban commented 7 years ago

When merging all files together, each record in aggregated files (i.e. prescriber.summary, PUF.summary) gets replicated the no. of times the unique doc_id appears in the detailed files (prescriber.detailed,PUF.detailed).

The issue here is

  1. When all files merged together, while performing EDA or calculation, the replicated values get aggregated, which is an error.
  2. To avoid replication of values , If I consider merging only the aggregated files (prescriber.summary,PUF.summary) I'm missing out on granularity on drug details and few other variables.
  3. Or else, Is it advisable to proceed only with PUF.pres_ consolidated. file(with existing variables) which already exist ?

Which is an appropriate approach?

@Rajhan How do I go about it ?

Rajhan commented 7 years ago

each record in summary files is unique. The detailed file is not, but do drug level and procedure level aggregation. The consolidated table is derived from the other four files, use that as an example to build a dataset for each problem statement.

Problem statement: Utilization & Prescriber Data

Classification

  1. Prescriber of Avonex vs Copaxone
  2. Prescribers of Gilenya vs Copaxone
  3. Prescribers of Branded vs Generics Prediction
  4. Prescribers of Namenda
  5. Prescribers of Generics

For example, Prescriber of Avonex vs Copaxone, take only the docs who have prescribed Avonex and copa. and consolidate all other information. Reminder, when applying any classification alg., you are not going to take doc_id.