Closed sonali-sr closed 1 year ago
I've dropped the debar_is_repeated column and it shows like that in the dataframe as well, but the shape is still being given as 97x6 Is it the drop function that's not working as I intend it to?
Code: debar_clean.drop(columns = "debar_is_repeated") debar_clean.shape
Picture:
Hi @sanhatahir - I think you just have to save the result. I would recommend trying : debar_clean = debar_clean.drop(columns = "debar_is_repeated")
Oh, you're completely right I should get some sleep Thanks so much!
I'm stuck on this question (part B). I don't really understand how/why to use pd.concat in this case, so I've been trying to use a for loop but keep getting an error. Any advice?
I'm stuck on this question (part B). I don't really understand how/why to use pd.concat in this case, so I've been trying to use a for loop but keep getting an error. Any advice?
The pd.concat hint corresponds to an approach where we:
debar
datadup.Name
and don't necessarily need the list and values commands in your codedebar_keepviol1 = debar[(debar.Name.isin(dup_names)) & (debar.viol_num == "viol1")].copy()
debar_keepallviols = debar[~debar.Name.isin(dup_names)].copy()
the for loop is correct intuition but i'm not sure which part of the for loop is creating the error (the if
part or the drop
part),--- but the above row filtering is doing what the for loop is aiming at more succinctly so i'd try the row filtering approach and post follow up q's as needed
for question 1.4, why would the result be 94 instead of 97. I just deleted all the viol2 rows
I'm stuck on this question (part B). I don't really understand how/why to use pd.concat in this case, so I've been trying to use a for loop but keep getting an error. Any advice?
The pd.concat hint corresponds to an approach where we:
- go back to original
debar
data- use row filtering to separate that data into two dataframes: (1) debar_keepallviols and (2) debar_keepviol1 --- the first one is defined by having a name not in the list of duplicated names; the second one by having a name in the list of duplicated names- so you can do something like the follows-- you can get dup_names just by
dup.Name
and don't necessarily need the list and values commands in your codedebar_keepviol1 = debar[(debar.Name.isin(dup_names)) & (debar.viol_num == "viol1")].copy() debar_keepallviols = debar[~debar.Name.isin(dup_names)].copy()
- pd.concat can be used to stack two dataframes together - so you can then use that to put them back together - https://pandas.pydata.org/docs/reference/api/pandas.concat.html
the for loop is correct intuition but i'm not sure which part of the for loop is creating the error (the
if
part or thedrop
part),--- but the above row filtering is doing what the for loop is aiming at more succinctly so i'd try the row filtering approach and post follow up q's as needed
Ah this makes sense and worked - thanks so much! When I tried concat initially, I was doing it as the first step and couldn't figure it out from there. Doing the subsetting of the dfs first makes much more sense.
Question from student:
for question 1.3. is there a reason why i got a 94 insead of 97?
response:
there are 94 unique employers but 97 rows; for employers whose rows are not duplicated (so employers who do NOT have the same start and end date for the violation) you should retain both viol1 and viol2. your code screenshot seems to remove viol2 for all employers, which is not correct.
1.4 Filter out duplicates from original debar data (6 points)
A. Using
mult_debar_wide
, add a columnis_dup
that takes value of True for cases where start_date_viol1 == start_date_viol2 marking the row as a duplicateB. Going back to the original long-format data you loaded at the beginning -
debar
- For employers whereis_dup == True
as indicated by your wide-format dataframe, only keepviolnum == viol1
- For all other employers (so is_dup == False and ones we didnt need to check duplicates for), keep all violnum - Remove theis_repeated
column from thedebar
dataHint: you can complete part B without a for loop;
pd.concat
with axis = 0 (row binding) is one wayCall the resulting dataframe
debar_clean
and print the shape and # of unique employer names