Variable Importance Plot from C5.0 algorithm (Avonex vs Copaxone)

The cross-validated model has an accuracy of ~95% !
Usage/Effectiveness of Attributes from Prescriber files seem to be more when compared to Attributes from PUF files.

Custom Features:

12 out of 15 features (highlighted in red) seem to contribute effectively towards building the model.
Usage is at 100% for (mean_cost_per_day_per_claim, generic_usage_score_top_2), while for (generic_usage_score_top_3) it is ~85%.
- mean_cost_per_day_per_claim is derived from the Prescriber_Detailed file
- generic_usage_scores is also derived from the Prescriber_Detailed file
- It is binary and it is based on 'top_drug_n' attributes.
Usage is at ~25% for Switch_Likelihood
- It is derived from whether a doc_id is prescribing different drug_names that fall under the same generic_name and the proportion of all drugs prescribed that follow this criterion.
Usage is between 10% and 15% for (top_hcpcs_code_1, top_hcpcs_code_1_received_per_submitted_charge)
- They are both derived from the PUF_Detailed file, in a fashion similar to top_drug_n & top_drug_n_cost

@Rajhan , I have shared all the relevant codes and RData files on my GitHub Repository. Please do check it out and share your feedback.

P.S.:

I have invited you as a collaborator on my GitHub Repository.
Did you derive the top_drug_n and its associated cost from the Prescriber_Detailed file ?

sagitechls / SSN_SACE_2017_Jan