mediacloud / web-tools

The shared repository for Media Cloud web apps (Explorer, Source Manager, Topic Mapper)
https://tools.mediacloud.org
Apache License 2.0
64 stars 30 forks source link

include `date_is_reliable` is topic CSV download? #1647

Closed rahulbot closed 2 years ago

rahulbot commented 5 years ago

Aashka is wondering if we can help give better clues for when dates can be trusted and when not. For instance, in her UN SDGs topic she saw a spike on a date, but it turned out to be just many stories incorrectly dated.

We use the date_is_reliable column to show story dates in the web interface in italics with a "?" after them. Should we include this Boolean variable in the download CSV? @hroberts how is this attribute filed in? What does it mean?

hroberts commented 5 years ago

we add that flag unless we got the date from an rss feed (or some similar structured syndication format) or is one of a small number of date guessing methods that we consider to be very reliable (such as a date stub in the url). for all other methods combined, our date guessing is around 80% accurate to within a day. the best defense for this is just for aashka to have a sense that if you see a big weird date spike, she should consider date guessing as the cause.

in theory, we could add some sort of monitor to look for date spikes within topics (they all come from topics because we only use date guessing for spidered stories). but I'm not sure what we would do once we found that spike other than warn the user. we're already doing the best we can at the date guessing, so there's nothing more than we can do without manual intervention.

-hal

On Tue, Sep 10, 2019 at 11:46 AM rahulbot notifications@github.com wrote:

WARNING: Harvard's email systems could not validate that the sender of this message is legitimate. Please be cautious in opening attachments, clicking any links, or following any other instructions in this email. [Error Code: SF]

Aashka is wondering if we can help give better clues for when dates can be trusted and when not. For instance, in her UN SDGs topic she saw a spike on a date, but it turned out to be just many stories incorrectly dated.

We use the date_is_reliable column to show story dates in the web interface in italics with a "?" after them. Should we include this Boolean variable in the download CSV? @hroberts https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_hroberts&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=-3cdmUljwGsuMztYWk6ld5ICULkMBLs0OjinrLfXmEo&s=jsX44BLQXvR6-nZcGFOjw75vZcYrbLjK88_T_49HRQU&e= how is this attribute filed in? What does it mean?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mitmedialab_MediaCloud-2DWeb-2DTools_issues_1647-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAAN66TYSLACUARKFSPDCFX3QI7FMTA5CNFSM4IVKDCY2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HKQBYHQ&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=-3cdmUljwGsuMztYWk6ld5ICULkMBLs0OjinrLfXmEo&s=hwFnULjgFKfNE2Ca5Qak8XIsXKNdkKqqiwTTGCbOCTw&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAN66TZXCKNZZTEVQBHD2TTQI7FMTANCNFSM4IVKDCYQ&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=-3cdmUljwGsuMztYWk6ld5ICULkMBLs0OjinrLfXmEo&s=bOyxYrJOKx5r806HcsyM-f70KhJJhvbF2Lr7YOD8crc&e= .

rahulbot commented 5 years ago

Thanks, that's helpful. Will circle back with research folks here.