How to account for overlap? For example, CDA is different from C-CDA, but C-CDA counts probably include all instances of CDA.
Tasks
[x] 1. C-CDA should not include CDA: The results of df_raw (zulip_raw_results.csv) for CDA is too high, I think. It likely contains all instances of C-CDA. There are two approaches: (1) we could leave df_raw the way it is, and only update the counts. I suppose it would be simple enough to recalculate the count for CDA by subtracting the C-CDA count. (2) we could actually text-mine the messages in df_raw['content']. For all the C-CDA rows, we'd look at the text in that content field, and if there is the plain word "CDA", but no instances of "C-CDA" in content, then we actually do not want that row to appear in the dataframe, so we should remove all such rows from df_raw. If we do that, then the report1_counts that are generated will have correct counts.
[ ] 2. Check/fix for other similar cases?: @rohaher I don't think it's critical you do this now, but if you have and want to, you can start on this. After you've finished with (1), you can write more generalized logic. How I'm thinking it would work is that you want to find all the instances in which the text of one keyword is fully contained within another keyword, similar to CDA being fully contained within C-CDA. You would basically iterate over all of the category_keywords, and then for each of these (a) keywords, you again iterate over the entire list of (b) keywords again, and check if (b) is contained within (a). Here's an article showing several methods on how to do that. After that is done, I haven't thought too much about the next part. But you should be able to repurpose your logic from (1).
Summary
How to account for overlap? For example, CDA is different from C-CDA, but C-CDA counts probably include all instances of CDA.
Tasks
df_raw
(zulip_raw_results.csv
) for CDA is too high, I think. It likely contains all instances of C-CDA. There are two approaches: (1) we could leavedf_raw
the way it is, and only update the counts. I suppose it would be simple enough to recalculate the count forCDA
by subtracting theC-CDA
count. (2) we could actually text-mine the messages indf_raw['content']
. For all the C-CDA rows, we'd look at the text in thatcontent
field, and if there is the plain word "CDA", but no instances of "C-CDA" incontent
, then we actually do not want that row to appear in the dataframe, so we should remove all such rows fromdf_raw
. If we do that, then thereport1_counts
that are generated will have correct counts.category_keywords
, and then for each of these (a) keywords, you again iterate over the entire list of (b) keywords again, and check if (b) is contained within (a). Here's an article showing several methods on how to do that. After that is done, I haven't thought too much about the next part. But you should be able to repurpose your logic from (1).