process_data: add comments, refactor code, and add tests

I tried to add code comments to help understanding the data preprocessing script, e.g., what data are removed, and what are kept for the scoring algorithms. I ended up with refactoring the code and adding tests for the process_data.py script. I will start with result validation, then explain the changes I made, and finally compare the logs.

Result validation

The results of preprocess_data function are three dataframe, notes, ratings, noteStatusHistory. Those dataframe will be the input of scoring algorithms. I validated my results by writing the intermediate dataframe to disk, and compared them wit results from the latest release version (c7db275) by doing a diff. The intermediate dataframes are exactly the same.

Add code comments and logs to explain what notes/ratings are kept or removed

I think the current code comments/logs can be improved to help the users understand the data preprocessing part. As per my understanding,

Some notes have no ratings, they are removed
Some ratings are in a bad format, (i.e., helpfulKey=0 and notHelpfulKey=0), they are removed
In the _filter_misleading_notes function, the logic does the following:
- For deleted notes (c.classificationKey is NaN):
- Keep ratings of notes that appear in noteStatusHistory (previously scored)
- Remove ratings of notes that do not appear in noteStatusHistory
- For still available notes (c.classificationKey is either MISINFORMED_OR_POTENTIALLY_MISLEADING or NOT_MISLEADING):
- Keep ratings of notes saying the associated tweet is misleading
- For those saying the associated tweet is not misleading:
  - Keep ratings after the new UI launch time, c.notMisleadingUILaunchTime
  - Remove ratings before the new UI launch time, c.notMisleadingUILaunchTime

I added logs to each of the three steps, and showed how the row numbers changed (see the new log output below). I also added test cases for step 3, because the numbers should add up. The tests are here.

Log comparison

I showed the log comparison based on data released on 2024-02-07. The new log shows the row number of dataframe from reading the provided tsv files, and a detailed history of dataframe row changes as we go through Step 1-3 above.

Previous log output from the latest release version (c7db275)

Timestamp of latest rating in data:  2024-02-07 16:02:45.464000
Timestamp of latest note in data:  2024-02-06 23:59:50.903000
total notes added to noteStatusHistory: 135
Preprocess Data: Filter misleading notes, starting with 32896425 ratings on 559289 notes
  Keeping 24108295 ratings on 383748 misleading notes
  Keeping 1649230 ratings on 40531 deleted notes that were previously scored (in note status history)
  Removing 67795 ratings on 3255 older notes that aren't deleted, but are not-misleading.
  Removing 51372 ratings on 2599 notes that were deleted and not in note status history (e.g. old).
Num Ratings: 32777258, Num Unique Notes Rated: 553435, Num Unique Raters: 417291

New log output from my PR

Timestamp of latest rating in data:  2024-02-07 16:02:45.464000
Timestamp of latest note in data:  2024-02-06 23:59:50.903000
Original row numbers from provided tsv files
   notes: 565675
   ratings: 32897041
   noteStatusHistory: 629615

After removing duplicates, there are 565675 notes and 32897041 ratings from 559305 notes
  Thus, 6370 notes have no ratings yet, removed...
After populating helpfulNumKey, there are 32896425 ratings from 559289 notes
  Thus, 616 ratings have no helpfulness labels (i.e., helpfulKey=0 and notHelpfulKey=0), removed...
total notes added to noteStatusHistory: 135
Finished filtering misleading notes
Preprocess Data: Filter misleading notes, starting with 32896425 ratings on 559289 notes
For 1700602 ratings on 43130 deleted notes
  Keep 1649230 ratings on 40531 deleted notes that are in noteStatusHistory (e.g., previously scored)
  Remove 51372 ratings on 2599 deleted notes that are not in noteStatusHistory (e.g., old)
For 31195823 ratings on 516159 still available notes
  Keep 24108295 ratings on 383748 available notes saying the associated tweet is misleading
  For 7087528 ratings on 132411 available notes saying the associated tweet is not misleading
    Keep 7019733 ratings on 129156 available and not misleading notes, and after the new UI launch time
    Remove 67795 ratings on 3255 available and not misleading notes, and before the new UI launch time
After data preprocess, Num Ratings: 32777258, Num Unique Notes Rated: 553435, Num Unique Raters: 417291

twitter / communitynotes