Implement support for Twitter 2 Stream API

dolsysmith commented 2 years ago

Features

Formerly "filter"
Filtered stream rules allow for the expression of complex queries
Users with "essential" API access are allowed to use only 5 filter stream rules concurrently; "elevated," 25; and "academic research," 1,000
Not all rule operators are available to "essential" and "elevated" users
Prior to running a stream, you must register one or more rules with the API (the twarc client's add_stream_rules method handles these calls, and also provides methods for retrieving the rules currently registered and for deleting rules)
We will need to provide users with the means of managing their "rules" (seeds) and/or prevent them from creating more rules/seeds than they are allowed per their access level
The twarc client's stream method returns one tweet per iteration (vs. paginated results)
The associated stream rule is returned in the matching_rules field, but this does NOT seem to be included when using ensure_flattened to produce a single dict per Tweet

To Do

Update twitter_harvester.py, twitter_stream_warc_iter.py and (if necessary) twitter_stream_exporter.py.
Make sure the stream rule is included as part of the Tweet JSON & CSV

dolsysmith commented 2 years ago

Per Sam Hames (one of the Twarc developers), we should invoke ensure_flattened in the export, not the harvest, workflow. Sam's comment:

"I would hope that you're only using that for analysis purposes though, for data storage of hope you're preserving the original format. An early version of twarc2 had an option to flatten output to a stream of tweet objects, but we removed it because its hard to get right, and means that downstream tools don't have a consistent format to work with."

dolsysmith commented 1 year ago

@adhithyakiran started the sfm_filter_stream branch on sfm-twitter-harvester to address this ticket.

Working:

harvesting Tweets from the v2 streaming endpoint
saving WARC files In Progress:
changes to twitter_stream_warc_iter.py to deal with the new JSON object model
count of Tweets not working Ahead:
Users need to be able to create and manage their streaming rules in the UI.

dolsysmith commented 1 year ago

To implement

Users can add more than one active seed to a filter stream collection (logic currently prohibits this)
stream-2 exporter Docker image / changes to example.docker-compose
Harvester registers streaming rules on the basis of the active seeds
Exporter includes streaming rules in exports (as an additional column)
Harvester/UI handles errors from too many streaming rules
UI allows the user to delete rules (deleting seeds triggers deletion of rules)

To test

Restarting harvest after adding/deleting seeds
Adding more seeds than allowed by API limits
Harvest hits API limit for max Tweets per month
Adding duplicate rules
Test features in processing container with harvests from v2.

dolsysmith commented 1 year ago

Update (2/2/2023):

v.2 streaming harvest is working (minimally) in the UI. Harvest can be started and stopped, and Tweets seem to be retrieved.
Currently, any active seeds when the harvest is turned on are converted to streaming rules. Do we need to check for duplicates (if the same rules have already been registered on a previous run)? How about handling errors (if a user submits more rules than they are allowed)?
UI form needs adjustment to reflect the new structure of streaming queries.
We're passing a limit parameter (hard-coded at 500) to twarc.stream, but that doesn't seem to have the desired effect. Can we provide the user the ability to set an upper bound to the number of Tweets harvested?
Streaming exporter not working yet. The UI sends a "start" message, but the exporter never receives it.

dolsysmith commented 1 year ago

Update (2/23/2023):

Streaming harvester & exporter are working, though further testing is needed.

There are a couple of issues that I don't think we can resolve without more significant changes to the architecture and data model:

Currently, users can set limits for the "search" and "academic search" harvests, which work to curtail the volume of Tweets retrieved per harvest. I tried to implement this functionality for the streaming harvests, but while it's possible to make the harvester shut off when the limit is reached, there's not currently a mechanism to signal to the UI that this has occurred. From the user's perspective, it would appear that the harvester is still running even after shutdown. That's obviously not desirable, so for the moment, I am not implementing any limits for the streaming harvests. (One approach to implementation -- for a future release -- might be to modify the message that the streaming harvester sends back to the UI, including an indicator that can be set by the harvester upon shutting down.)
As implemented, each seed created by the user is treated by SFM as a "streaming rule" and set accordingly. Thus, a single streaming harvest can apply multiple rules. However, treating these rules as "seeds" is problematic, since a seed, per SFM's design, is intended to represent a single query/result set, whereas multiple streaming rules can be present in a given result set. (This discrepancy impacts functionality such as the ability to select a subset of seeds for export. Such functionality doesn't work when seeds are, in fact, streaming rules, since the exporter as written has no way of parsing out individual streaming rules from a set of WARC's. Therefore, I've disabled this functionality for streaming exports.)

gwu-libraries / sfm-ui