twitter / communitynotes

Documentation and source code powering Twitter's Community Notes
https://twitter.github.io/communitynotes
Apache License 2.0
1.48k stars 213 forks source link

Division by zero error #179

Closed christosporios closed 11 months ago

christosporios commented 11 months ago

I've unfortunately lost the full output logs, but this line divided by zero and caused the script to fail:

https://github.com/twitter/communitynotes/blame/d3f0ea19b55fac35a0bac0bda7c33d73425fed5e/sourcecode/scoring/note_status_history.py#L181C43-L181C54

Specifically, len(oldCrhNotes) was zero. I'll try to find time to debug this later, but opening this issue in case this is obvious to someone with more context (@bradmiller ?). Noting that this with the most recent (as of today) data files and python 3.7.9.

jbaxter commented 11 months ago

We are not seeing this issue in production so I am wondering what data files you had downloaded and what command you used to invoke the run

christosporios commented 11 months ago

I ran main.py with the right dependencies and python version, as specified in the README. The data files are below, and they come from https://twitter.com/i/communitynotes/download-data.

noteStatusHistory-00000.tsv  ratings-00000.tsv  ratings-00002.tsv  userEnrollment-00000.tsv
notes-00000.tsv              ratings-00001.tsv  ratings-00003.tsv
christosporios commented 11 months ago

Should I be joining ratings-0000{0-3}.tsv together before running this?

avalanchesiqi commented 11 months ago

I also saw the same issue in my reproduction. I have tried only one rating file and multiple rating files, both yielded the same exception. In my case, I commented out the assert statement and then everything worked. But that may be my hack.

haelyons commented 11 months ago

Also saw this issue when running a build with the files provided at on the Community Notes Twitter with the following traceback:

Traceback (most recent call last):
  File "main.py", line 17, in <module>
    main()
  File "/home/hal/communitynotes/sourcecode/scoring/runner.py", line 105, in main
    dataLoader=dataLoader if args.parallel == True else None,
  File "/home/hal/communitynotes/sourcecode/scoring/run_scoring.py", line 702, in run_scoring
    noteStatusHistory, scoredNotes
  File "/home/hal/communitynotes/sourcecode/scoring/note_status_history.py", line 234, in update_note_status_history
    _check_flips(mergedStatuses)
  File "/home/hal/communitynotes/sourcecode/scoring/note_status_history.py", line 181, in _check_flips
    (len(newCrhNotes - oldCrhNotes) / len(oldCrhNotes)) < maxCrhChurn
ZeroDivisionError: division by zero
avalanchesiqi commented 11 months ago

I had some success in investigating the code related to this issue.

Quick workaround to bypass this issue

Comment out these two lines. It won't perform the check and no errors will be thrown.

More investigations

The real question here is whether these assertion tests work as they are designed. I will do a bit line by line analysis

note_status_history.py

233  if len(mergedStatuses) > c.minNumNotesForProdData:
234    _check_flips(mergedStatuses)

mergedStatuses is a dataframe that combines oldNoteStatusHistory and newNoteStatusHistory. If you use the published Community Notes data before Nov 25 2023, you will have a dataframe with 401490 entries. This is way over the preset threshold minNumNotesForProdData = 200. So the _check_flips(mergedStatuses) line will be executed.

The problem is in those lines. And I suspect they are not working as expected, or at least there is a difference between the production environment and local setup which causes the exception.

170  # Prune to unlocked notes.
171  mergedStatuses = mergedStatuses[mergedStatuses[c.timestampMillisOfStatusLockKey].isna()]
172  # Identify new and old CRH notes.
173  oldCrhNotes = frozenset(
174    mergedStatuses[mergedStatuses[c.currentLabelKey] == c.currentlyRatedHelpful][c.noteIdKey]
175  )
176  newCrhNotes = frozenset(
177    mergedStatuses[mergedStatuses[c.finalRatingStatusKey] == c.currentlyRatedHelpful][c.noteIdKey]
178  )

Line 171 will filter out mergedStatuses that has timestampMillisOfStatusLock equals to na. This is to find out new notes and can be found in the training log. I have 8 new notes that will be updated to the noteStatusHistory file.

total notes added to noteStatusHistory: 8

Line 174 is the line causing the error. It selects currentStatus="CURRENTLY_RATED_HELPFUL" notes in these 8 new notes. By definition, it will be zero because these notes are new and do not have a currentStatus yet. They are all na.

My local log below

>>> mergedStatuses[mergedStatuses['timestampMillisOfStatusLock'].isna()][["currentStatus", "finalRatingStatus"]]
       currentStatus   finalRatingStatus
401481           NaN  NEEDS_MORE_RATINGS
401482           NaN  NEEDS_MORE_RATINGS
401483           NaN  NEEDS_MORE_RATINGS
401484           NaN  NEEDS_MORE_RATINGS
401485           NaN  NEEDS_MORE_RATINGS
401486           NaN  NEEDS_MORE_RATINGS
401487           NaN  NEEDS_MORE_RATINGS
401488           NaN  NEEDS_MORE_RATINGS

In fact both oldCrhNotes and newCrhNotes will be empty.

@jbaxter thoughts?

jbaxter commented 11 months ago

Thank you all for tracking this down! The root cause was quite subtle: it looks like the script we use to process our internal TSV files for public release has replacing NaNs with -1 instead of empty string as they were in the internal code. I've fixed that export script so the code should run properly if you re-download the data files (or commenting out this assert works too).

jbaxter commented 11 months ago

Hmm, seems like we released another version of the file with -1s. Should be fixed upon the next release (in a few hours)