Use Twitter API v2 - Githubissues

sebastian-nagel commented 3 years ago

Fyi, some work is done on the branch https://github.com/sebastian-nagel/sfm-twitter-harvester/tree/twarc2:

successfully harvested user timelines
most unit tests pass

Backward-incompatible changes in the v2 API:

"geocode" (v2 "point_radius") only available for Academic Research and limited to a max. radius of 25 miles

TODOs are

exporters and stream WARC iterators including unit tests
things to be added (also) in sfm-ui:
- allow to use bearer token for authentication
- use academic research product track
- (optionally) make tweet fields and expansions configurable
- "mentions" as a new type?

So far, we'll like not be able to do all the remaining upgrade work in the next days/weeks. In case the work is useful, feel free to pick whatever you want. Thanks!

lwrubel commented 2 years ago

I've done some preliminary work on integrating @sebastian-nagel's harvester code side-by-side with existing Twitter v1 REST API harvests and wired some of the harvest types up to sfm-ui. See:

Assumptions include:

We want to keep v1 and v2 harvesting running side-by-side until v1 is no longer available. This results in some duplication of code, but easier extraction of v1 harvesting later. Probably want to approach exporting differently, since those will need to remain as long as there are v1 collections in SFM.
We want to use the existing twitter_rest_harvester queue, just adding new routing keys for the v2 harvest types.
We're harvesting and storing the JSON as received from Twitter. We could then run twarc2's ensure_flattened() with any processing for moving expansions inline with the tweet data in data.

So far, on these branches you can:

Add v2 standard search and Academic Research credentials
Create collections and add/update/delete seeds for standard search (twarc2 recent_search) and Academic full archive search (twarc2's search_all), including limiting the number of results.
Run a harvest of both search types.

Remaining work on this includes:

Test coverage in sfm-ui of the new v2 code and more thorough testing, in general.
I added in the code for user_timelines_2 but haven't created corresponding sfm-ui code yet or tried those harvesters out. Two tests are failing, and this is probably because of a change I've introduced.
Looking closely at the harvested content to make sure that the tweet JSON is being recorded correctly and the way we want in order to support exporting.
Writing exporters.
Testing the app & user auth--I used bearer tokens in this work.
And of course, there are filter and sample collections (and possibly more) harvest types to add.

Note that twitter_rest_warc_iter.py still contains if __name__ == "__main__": TwitterRestWarcIter.main(TwitterRestWarcIter) and is not yet set up to use the new TwitterRestWarcIter2 class. You can edit it, of course, to run it, but that will need to be addressed for situations where we use it on the command-line.

I occasionally get 400 errors from the API if I use the exact same seed twice in a search. I haven't figured out what might be prompting that, but a new seed will run just fine. Just wanted to report that so you're aware.

kerchner commented 2 years ago

Thank you, @lwrubel !

sebastian-nagel commented 2 years ago

Hi @lwrubel, thanks a lot! I think we'll switch to your branch. We'll share all experiences - but for now: happy holidays!

dolsysmith commented 2 years ago

Hi @sebastian-nagel, we're planning to work on implementing support for Twitter v. 2 this summer, probably starting in July. (We did a previous sprint this semester to identify impacts on the UI and the database models, especially as concerns the new filter stream API.)

Since you had already started on this work, we're wondering if you'd be interested in, and have the bandwidth for, collaborating with us this summer. If so, we could coordinate work on a sprint. If not, you have our gratitude for getting us started!

Best, Dolsy

sebastian-nagel commented 2 years ago

Hi @dolsysmith, great to hear! I cannot promise that I can take part in the sprint. But happy to have a look at the specification or implementation. In June, we plan to bring SFM one Twitter-Harvester using the v2 API into production, in order to harvest longer user timelines. But this would not include any new features.

dolsysmith commented 2 years ago

Thanks, @sebastian-nagel -- totally understandable. And we'd definitely be interested to hear about your experience bringing the v2 timeline into production.

sebastian-nagel commented 2 years ago

I've also tested the changes in the branch twitter-v2.

incremental harvests of user timelines work. However, there is a small difference between v1 and v2: the v1 harvester fetches the latest tweet every run again even if there are no new tweets, while the v2 harvester just archives empty response. Looks like this is implemented in Twarc's v1 client.py. Don't know whether this is actually an issue.

Two tests are failing

Fixed in gwu-libraries/sfm-twitter-harvester#55

Note that twitter_rest_warc_iter.py ... is not yet set up to use the new TwitterRestWarcIter2 class. You can edit it, of course, to run it, but that will need to be addressed for situations where we use it on the command-line.

Some kind of auto-detection would be helpful here.

dolsysmith commented 2 years ago

Documentation with roadmap

sebastian-nagel commented 2 years ago

@dolsysmith, the roadmap document isn't publicly readable. Is this intended?

We've brought the SFM Twitter v2 harvester into production:

UI branch twitter-v2
Twitter harvester branch twitter-v2 with some changes, see open pull requests
full user timelines, "emulated" by collection type "Twitter academic search" and the query from:username

dolsysmith commented 2 years ago

Sorry about that, @sebastian-nagel. Our enterprise version of Google Docs doesn't allow public sharing, but I'll post it in a different format once the team here was had a chance to discuss the roadmap (probably later today).

And thanks for your update!

dolsysmith commented 2 years ago

@sebastian-nagel A couple of questions:

Are you able to test with the Academic Research credentials? No one on our team was approved for that level of access, unfortunately.
Just to confirm: you haven't done any work on twitter_rest_exporter.py, have you?

sebastian-nagel commented 2 years ago

1 - yes, 2 - no (only had a look - it seems that Twarc's json2csv.py does not yet support v2 API results)

dolsysmith commented 2 years ago

Thanks, @sebastian-nagel. We'll be working on the exporter during this sprint; in twarc2, the JSON-to-CSV utility has been separated into a new library, but in my testing at the command line, it works quite well.

dolsysmith commented 2 years ago

I added a condensed version (without all the working notes) of our roadmap for v. 2 to the repo's wiki.

dolsysmith commented 2 years ago

Initial observations on testing the twitter-v2 branch with one of the PR's contributed by @sebastian-nagel:

The limit behavior is a little curious: I got 599 Tweets from the following search:"blacklivesmatter" limit = 500. This happens when using twarc2 at the command line as well.
Running twarc.expansions.ensure_flattened on the JSON results from twarc2 (at the command line) yields JSON objects with a __twarc field, which includes the URL of the request. This is missing from the results of running twarc_rest_warc_iter.py on the WARC file harvested by SFM.
Tested a search with an operator (mobile games" -is:nullcast) disallowed for my access level. This results in a 400 error, which is displayed in the UI, but with no information as to the specific cause.
Tested a search on the search_all endpoint without the right access. Got a 403 error ("Forbidden for URL").
No differences noticed so far between using the bearer token vs. the consumer key/secret.
To Do: Remove the start/end date fields from the regular (non-academic) search seed form. Using these without Academic Research access produces a 400 error.

sebastian-nagel commented 2 years ago

with a __twarc field

That's added by the Twarc client. But SFM captures the HTTP traffic between the client and the Twitter server. Because there's no __twarc in the JSON response, ensure_flattened does not add one to each tweet, cf. expansions.py#L211.

dolsysmith commented 2 years ago

@kerchner @adhithyakiran @sebastian-nagel I pushed a new commit to t1103-exporter-v2 on the sfm-twitter-harvester repo. This version uses the twarc_csv code more efficiently than in my first attempt, so exporting to CSV/Excel/TSV seems to take time comparable to exports for v. 1.

For further discussion/testing:

In the exporter class for v2, I had to duplicate the on_message method from the base class in order to invoke the twarc_csv methods downstream from the WARC iteration. The amount of duplication feels inelegant, but I couldn't think of another way to do it, short of modifying the base class itself (which seemed less than optimal). But if you can think of any improvements, feel free to try them and/or let me know.
For use with larger collections: since twarc_csv uses a DataFrame as an intermediate state, there could be memory issues when producing larger files. One approach would be to disable the options in the UI (for v.2 exports) that allow the user to request more than 100K or 200K items per file (for the Excel, CSV, TSV modes). But I'll look into ways to control that behavior more programmatically. (DataFrames can append to existing CSV/Excel files, but I'm not yet sure where to incorporate that logic into the exporter code.)
The field-limited JSON export is still not working. Will explore what might work after working on the above.

dolsysmith commented 2 years ago

Update on the above: latest code in the branch now implements a MAX_DATAFRAME_ROWS constant (may want to make this an environment variable instead) to prevent exporters from eating too much memory. I've set this to 25K, but we can tweak as we test.

Note that for CSV and TSV files, the app will respect the segment size selected by the user for creating the number of files (e.g., 250K, 1M, etc.) by doing an append operation. But the current Excel engine in use in the app doesn't allow appending, so the segment size for Excel exports is limited to the maximum DataFrame size. (We'll need to change the options in the UI.)

dolsysmith commented 2 years ago

Exporter performance comparison on large dataset (~1 million Tweets):

Excel v2. exporter took 50% longer
CSV: 50% longer
Simplified JSON: 50% longer

I suspect that this relative slowness arises from either the overhead of converting the Tweet JSON to a pandas DataFrame (twarc-csv) or from flattening the original JSON response (twarc2). We can try increasing the MAX_DATAFRAME_ROWS parameter to see if that optimizes things.

gwu-libraries / sfm-ui

Use Twitter API v2 #1075