gwu-libraries / sfm-ui

Social Feed Manager user interface application.
http://gwu-libraries.github.io/sfm-ui
MIT License
153 stars 25 forks source link

Implement support for Twitter 2 Stream API #1104

Closed dolsysmith closed 1 year ago

dolsysmith commented 2 years ago

Features

To Do

dolsysmith commented 2 years ago

Per Sam Hames (one of the Twarc developers), we should invoke ensure_flattened in the export, not the harvest, workflow. Sam's comment:

"I would hope that you're only using that for analysis purposes though, for data storage of hope you're preserving the original format. An early version of twarc2 had an option to flatten output to a stream of tweet objects, but we removed it because its hard to get right, and means that downstream tools don't have a consistent format to work with."

dolsysmith commented 1 year ago

@adhithyakiran started the sfm_filter_stream branch on sfm-twitter-harvester to address this ticket.

Working:

dolsysmith commented 1 year ago

To implement

To test

dolsysmith commented 1 year ago

Update (2/2/2023):

dolsysmith commented 1 year ago

Update (2/23/2023):

Streaming harvester & exporter are working, though further testing is needed.

There are a couple of issues that I don't think we can resolve without more significant changes to the architecture and data model:

  1. Currently, users can set limits for the "search" and "academic search" harvests, which work to curtail the volume of Tweets retrieved per harvest. I tried to implement this functionality for the streaming harvests, but while it's possible to make the harvester shut off when the limit is reached, there's not currently a mechanism to signal to the UI that this has occurred. From the user's perspective, it would appear that the harvester is still running even after shutdown. That's obviously not desirable, so for the moment, I am not implementing any limits for the streaming harvests. (One approach to implementation -- for a future release -- might be to modify the message that the streaming harvester sends back to the UI, including an indicator that can be set by the harvester upon shutting down.)
  2. As implemented, each seed created by the user is treated by SFM as a "streaming rule" and set accordingly. Thus, a single streaming harvest can apply multiple rules. However, treating these rules as "seeds" is problematic, since a seed, per SFM's design, is intended to represent a single query/result set, whereas multiple streaming rules can be present in a given result set. (This discrepancy impacts functionality such as the ability to select a subset of seeds for export. Such functionality doesn't work when seeds are, in fact, streaming rules, since the exporter as written has no way of parsing out individual streaming rules from a set of WARC's. Therefore, I've disabled this functionality for streaming exports.)