Open kerchner opened 3 years ago
I've done some preliminary work on integrating @sebastian-nagel's harvester code side-by-side with existing Twitter v1 REST API harvests and wired some of the harvest types up to sfm-ui. See:
Assumptions include:
data
.So far, on these branches you can:
Remaining work on this includes:
Note that twitter_rest_warc_iter.py still contains if __name__ == "__main__": TwitterRestWarcIter.main(TwitterRestWarcIter)
and is not yet set up to use the new TwitterRestWarcIter2 class. You can edit it, of course, to run it, but that will need to be addressed for situations where we use it on the command-line.
I occasionally get 400 errors from the API if I use the exact same seed twice in a search. I haven't figured out what might be prompting that, but a new seed will run just fine. Just wanted to report that so you're aware.
Thank you, @lwrubel !
Hi @lwrubel, thanks a lot! I think we'll switch to your branch. We'll share all experiences - but for now: happy holidays!
Hi @sebastian-nagel, we're planning to work on implementing support for Twitter v. 2 this summer, probably starting in July. (We did a previous sprint this semester to identify impacts on the UI and the database models, especially as concerns the new filter stream API.)
Since you had already started on this work, we're wondering if you'd be interested in, and have the bandwidth for, collaborating with us this summer. If so, we could coordinate work on a sprint. If not, you have our gratitude for getting us started!
Best, Dolsy
Hi @dolsysmith, great to hear! I cannot promise that I can take part in the sprint. But happy to have a look at the specification or implementation. In June, we plan to bring SFM one Twitter-Harvester using the v2 API into production, in order to harvest longer user timelines. But this would not include any new features.
Thanks, @sebastian-nagel -- totally understandable. And we'd definitely be interested to hear about your experience bringing the v2 timeline into production.
I've also tested the changes in the branch twitter-v2.
Two tests are failing
Fixed in gwu-libraries/sfm-twitter-harvester#55
Note that twitter_rest_warc_iter.py ... is not yet set up to use the new TwitterRestWarcIter2 class. You can edit it, of course, to run it, but that will need to be addressed for situations where we use it on the command-line.
Some kind of auto-detection would be helpful here.
@dolsysmith, the roadmap document isn't publicly readable. Is this intended?
We've brought the SFM Twitter v2 harvester into production:
from:username
Sorry about that, @sebastian-nagel. Our enterprise version of Google Docs doesn't allow public sharing, but I'll post it in a different format once the team here was had a chance to discuss the roadmap (probably later today).
And thanks for your update!
@sebastian-nagel A couple of questions:
twitter_rest_exporter.py
, have you?1 - yes, 2 - no (only had a look - it seems that Twarc's json2csv.py does not yet support v2 API results)
Thanks, @sebastian-nagel. We'll be working on the exporter during this sprint; in twarc2, the JSON-to-CSV utility has been separated into a new library, but in my testing at the command line, it works quite well.
I added a condensed version (without all the working notes) of our roadmap for v. 2 to the repo's wiki.
Initial observations on testing the twitter-v2 branch with one of the PR's contributed by @sebastian-nagel:
limit
behavior is a little curious: I got 599 Tweets from the following search:"blacklivesmatter" limit = 500
. This happens when using twarc2 at the command line as well.twarc.expansions.ensure_flattened
on the JSON results from twarc2 (at the command line) yields JSON objects with a __twarc
field, which includes the URL of the request. This is missing from the results of running twarc_rest_warc_iter.py
on the WARC file harvested by SFM.mobile games" -is:nullcast
) disallowed for my access level. This results in a 400 error, which is displayed in the UI, but with no information as to the specific cause. search_all
endpoint without the right access. Got a 403 error ("Forbidden for URL").with a
__twarc
field
That's added by the Twarc client. But SFM captures the HTTP traffic between the client and the Twitter server. Because there's no __twarc
in the JSON response, ensure_flattened does not add one to each tweet, cf. expansions.py#L211.
@kerchner @adhithyakiran @sebastian-nagel I pushed a new commit to t1103-exporter-v2 on the sfm-twitter-harvester repo. This version uses the twarc_csv code more efficiently than in my first attempt, so exporting to CSV/Excel/TSV seems to take time comparable to exports for v. 1.
For further discussion/testing:
on_message
method from the base class in order to invoke the twarc_csv methods downstream from the WARC iteration. The amount of duplication feels inelegant, but I couldn't think of another way to do it, short of modifying the base class itself (which seemed less than optimal). But if you can think of any improvements, feel free to try them and/or let me know.Update on the above: latest code in the branch now implements a MAX_DATAFRAME_ROWS
constant (may want to make this an environment variable instead) to prevent exporters from eating too much memory. I've set this to 25K, but we can tweak as we test.
Note that for CSV and TSV files, the app will respect the segment size selected by the user for creating the number of files (e.g., 250K, 1M, etc.) by doing an append operation. But the current Excel engine in use in the app doesn't allow appending, so the segment size for Excel exports is limited to the maximum DataFrame size. (We'll need to change the options in the UI.)
Exporter performance comparison on large dataset (~1 million Tweets):
Simplified JSON: 50% longer
I suspect that this relative slowness arises from either the overhead of converting the Tweet JSON to a pandas DataFrame (twarc-csv) or from flattening the original JSON response (twarc2). We can try increasing the MAX_DATAFRAME_ROWS
parameter to see if that optimizes things.
Fyi, some work is done on the branch https://github.com/sebastian-nagel/sfm-twitter-harvester/tree/twarc2:
Backward-incompatible changes in the v2 API:
TODOs are
So far, we'll like not be able to do all the remaining upgrade work in the next days/weeks. In case the work is useful, feel free to pick whatever you want. Thanks!