Closed christosporios closed 11 months ago
We are not seeing this issue in production so I am wondering what data files you had downloaded and what command you used to invoke the run
I ran main.py
with the right dependencies and python version, as specified in the README. The data files are below, and they come from https://twitter.com/i/communitynotes/download-data.
noteStatusHistory-00000.tsv ratings-00000.tsv ratings-00002.tsv userEnrollment-00000.tsv
notes-00000.tsv ratings-00001.tsv ratings-00003.tsv
Should I be joining ratings-0000{0-3}.tsv
together before running this?
I also saw the same issue in my reproduction. I have tried only one rating file and multiple rating files, both yielded the same exception. In my case, I commented out the assert
statement and then everything worked. But that may be my hack.
Also saw this issue when running a build with the files provided at on the Community Notes Twitter with the following traceback:
Traceback (most recent call last):
File "main.py", line 17, in <module>
main()
File "/home/hal/communitynotes/sourcecode/scoring/runner.py", line 105, in main
dataLoader=dataLoader if args.parallel == True else None,
File "/home/hal/communitynotes/sourcecode/scoring/run_scoring.py", line 702, in run_scoring
noteStatusHistory, scoredNotes
File "/home/hal/communitynotes/sourcecode/scoring/note_status_history.py", line 234, in update_note_status_history
_check_flips(mergedStatuses)
File "/home/hal/communitynotes/sourcecode/scoring/note_status_history.py", line 181, in _check_flips
(len(newCrhNotes - oldCrhNotes) / len(oldCrhNotes)) < maxCrhChurn
ZeroDivisionError: division by zero
I had some success in investigating the code related to this issue.
Comment out these two lines. It won't perform the check and no errors will be thrown.
The real question here is whether these assertion tests work as they are designed. I will do a bit line by line analysis
233 if len(mergedStatuses) > c.minNumNotesForProdData:
234 _check_flips(mergedStatuses)
mergedStatuses
is a dataframe that combines oldNoteStatusHistory
and newNoteStatusHistory
. If you use the published Community Notes data before Nov 25 2023, you will have a dataframe with 401490 entries. This is way over the preset threshold minNumNotesForProdData = 200
. So the _check_flips(mergedStatuses)
line will be executed.
The problem is in those lines. And I suspect they are not working as expected, or at least there is a difference between the production environment and local setup which causes the exception.
170 # Prune to unlocked notes.
171 mergedStatuses = mergedStatuses[mergedStatuses[c.timestampMillisOfStatusLockKey].isna()]
172 # Identify new and old CRH notes.
173 oldCrhNotes = frozenset(
174 mergedStatuses[mergedStatuses[c.currentLabelKey] == c.currentlyRatedHelpful][c.noteIdKey]
175 )
176 newCrhNotes = frozenset(
177 mergedStatuses[mergedStatuses[c.finalRatingStatusKey] == c.currentlyRatedHelpful][c.noteIdKey]
178 )
Line 171 will filter out mergedStatuses
that has timestampMillisOfStatusLock
equals to na
. This is to find out new notes and can be found in the training log. I have 8 new notes that will be updated to the noteStatusHistory
file.
total notes added to noteStatusHistory: 8
Line 174 is the line causing the error. It selects currentStatus="CURRENTLY_RATED_HELPFUL"
notes in these 8 new notes. By definition, it will be zero because these notes are new and do not have a currentStatus
yet. They are all na
.
My local log below
>>> mergedStatuses[mergedStatuses['timestampMillisOfStatusLock'].isna()][["currentStatus", "finalRatingStatus"]]
currentStatus finalRatingStatus
401481 NaN NEEDS_MORE_RATINGS
401482 NaN NEEDS_MORE_RATINGS
401483 NaN NEEDS_MORE_RATINGS
401484 NaN NEEDS_MORE_RATINGS
401485 NaN NEEDS_MORE_RATINGS
401486 NaN NEEDS_MORE_RATINGS
401487 NaN NEEDS_MORE_RATINGS
401488 NaN NEEDS_MORE_RATINGS
In fact both oldCrhNotes
and newCrhNotes
will be empty.
@jbaxter thoughts?
Thank you all for tracking this down! The root cause was quite subtle: it looks like the script we use to process our internal TSV files for public release has replacing NaNs with -1 instead of empty string as they were in the internal code. I've fixed that export script so the code should run properly if you re-download the data files (or commenting out this assert works too).
Hmm, seems like we released another version of the file with -1s. Should be fixed upon the next release (in a few hours)
I've unfortunately lost the full output logs, but this line divided by zero and caused the script to fail:
https://github.com/twitter/communitynotes/blame/d3f0ea19b55fac35a0bac0bda7c33d73425fed5e/sourcecode/scoring/note_status_history.py#L181C43-L181C54
Specifically,
len(oldCrhNotes)
was zero. I'll try to find time to debug this later, but opening this issue in case this is obvious to someone with more context (@bradmiller ?). Noting that this with the most recent (as of today) data files and python 3.7.9.