Closed dolsysmith closed 1 year ago
Per Sam Hames (one of the Twarc developers), we should invoke ensure_flattened
in the export, not the harvest, workflow. Sam's comment:
"I would hope that you're only using that for analysis purposes though, for data storage of hope you're preserving the original format. An early version of twarc2 had an option to flatten output to a stream of tweet objects, but we removed it because its hard to get right, and means that downstream tools don't have a consistent format to work with."
@adhithyakiran started the sfm_filter_stream branch on sfm-twitter-harvester to address this ticket.
Working:
Update (2/2/2023):
twarc.stream
, but that doesn't seem to have the desired effect. Can we provide the user the ability to set an upper bound to the number of Tweets harvested?Update (2/23/2023):
Streaming harvester & exporter are working, though further testing is needed.
There are a couple of issues that I don't think we can resolve without more significant changes to the architecture and data model:
Features
add_stream_rules
method handles these calls, and also provides methods for retrieving the rules currently registered and for deleting rules)stream
method returns one tweet per iteration (vs. paginated results)matching_rules field
, but this does NOT seem to be included when usingensure_flattened
to produce a single dict per TweetTo Do
twitter_harvester.py
,twitter_stream_warc_iter.py
and (if necessary)twitter_stream_exporter.py
.