jvmncs / default-risk

1 stars 8 forks source link

Data exploration: bureau.csv #4

Open jvmncs opened 6 years ago

jvmncs commented 6 years ago

This issue is for exploring the bureau table. It's sequential data, so most of the exploration will be related to analysing a time series of points related to each applicant,

Minimally, we'll want to know here are summary statistic about the time series nature of this table. In particular, two statistics come to mind: (1) number/percentage of applicants with previous credits in bureau.csv, and (2) average number of credits in the table per applicant with at least one credit. In particular, (2) will inform what kind of module we use to model the table (it's currently an LSTM, but that could change depending on these results).

The latter can be accomplished with a few simple pandas functions, e.g. something roughly similar to

total_applicants = ... # get this number from the application_train.csv table
print(len(bureau.loc[:, 'applicant_id'].unique())/total_applicants) # gives (1) above
counts = bureau.group_by(applicant_id).count()
print(counts.iloc[:, -1].mean()) # gives (2) above

except with proper column names and pandas syntax 🙂

We'll also need a good understanding of each feature. In particular, any systemic missing-ness should be made clear by this task. Hopefully, we'll have an understanding of how we want to represent each feature in the time series by the end of it, so that we'll be able to process accordingly.

There are some kernels available exploring this table, although there will be fewer than are available for application_train.csv.