gwu-libraries / sfm-ui

Social Feed Manager user interface application.
http://gwu-libraries.github.io/sfm-ui
MIT License
153 stars 25 forks source link

Use Twitter API v2 #1075

Open kerchner opened 3 years ago

sebastian-nagel commented 3 years ago

Fyi, some work is done on the branch https://github.com/sebastian-nagel/sfm-twitter-harvester/tree/twarc2:

Backward-incompatible changes in the v2 API:

TODOs are

So far, we'll like not be able to do all the remaining upgrade work in the next days/weeks. In case the work is useful, feel free to pick whatever you want. Thanks!

lwrubel commented 2 years ago

I've done some preliminary work on integrating @sebastian-nagel's harvester code side-by-side with existing Twitter v1 REST API harvests and wired some of the harvest types up to sfm-ui. See:

Assumptions include:

So far, on these branches you can:

Remaining work on this includes:

Note that twitter_rest_warc_iter.py still contains if __name__ == "__main__": TwitterRestWarcIter.main(TwitterRestWarcIter) and is not yet set up to use the new TwitterRestWarcIter2 class. You can edit it, of course, to run it, but that will need to be addressed for situations where we use it on the command-line.

I occasionally get 400 errors from the API if I use the exact same seed twice in a search. I haven't figured out what might be prompting that, but a new seed will run just fine. Just wanted to report that so you're aware.

kerchner commented 2 years ago

Thank you, @lwrubel !

sebastian-nagel commented 2 years ago

Hi @lwrubel, thanks a lot! I think we'll switch to your branch. We'll share all experiences - but for now: happy holidays!

dolsysmith commented 2 years ago

Hi @sebastian-nagel, we're planning to work on implementing support for Twitter v. 2 this summer, probably starting in July. (We did a previous sprint this semester to identify impacts on the UI and the database models, especially as concerns the new filter stream API.)

Since you had already started on this work, we're wondering if you'd be interested in, and have the bandwidth for, collaborating with us this summer. If so, we could coordinate work on a sprint. If not, you have our gratitude for getting us started!

Best, Dolsy

sebastian-nagel commented 2 years ago

Hi @dolsysmith, great to hear! I cannot promise that I can take part in the sprint. But happy to have a look at the specification or implementation. In June, we plan to bring SFM one Twitter-Harvester using the v2 API into production, in order to harvest longer user timelines. But this would not include any new features.

dolsysmith commented 2 years ago

Thanks, @sebastian-nagel -- totally understandable. And we'd definitely be interested to hear about your experience bringing the v2 timeline into production.

sebastian-nagel commented 2 years ago

I've also tested the changes in the branch twitter-v2.

Two tests are failing

Fixed in gwu-libraries/sfm-twitter-harvester#55

Note that twitter_rest_warc_iter.py ... is not yet set up to use the new TwitterRestWarcIter2 class. You can edit it, of course, to run it, but that will need to be addressed for situations where we use it on the command-line.

Some kind of auto-detection would be helpful here.

dolsysmith commented 2 years ago

Documentation with roadmap

sebastian-nagel commented 2 years ago

@dolsysmith, the roadmap document isn't publicly readable. Is this intended?

We've brought the SFM Twitter v2 harvester into production:

dolsysmith commented 2 years ago

Sorry about that, @sebastian-nagel. Our enterprise version of Google Docs doesn't allow public sharing, but I'll post it in a different format once the team here was had a chance to discuss the roadmap (probably later today).

And thanks for your update!

dolsysmith commented 2 years ago

@sebastian-nagel A couple of questions:

  1. Are you able to test with the Academic Research credentials? No one on our team was approved for that level of access, unfortunately.
  2. Just to confirm: you haven't done any work on twitter_rest_exporter.py, have you?
sebastian-nagel commented 2 years ago

1 - yes, 2 - no (only had a look - it seems that Twarc's json2csv.py does not yet support v2 API results)

dolsysmith commented 2 years ago

Thanks, @sebastian-nagel. We'll be working on the exporter during this sprint; in twarc2, the JSON-to-CSV utility has been separated into a new library, but in my testing at the command line, it works quite well.

dolsysmith commented 2 years ago

I added a condensed version (without all the working notes) of our roadmap for v. 2 to the repo's wiki.

dolsysmith commented 2 years ago

Initial observations on testing the twitter-v2 branch with one of the PR's contributed by @sebastian-nagel:

sebastian-nagel commented 2 years ago

with a __twarc field

That's added by the Twarc client. But SFM captures the HTTP traffic between the client and the Twitter server. Because there's no __twarc in the JSON response, ensure_flattened does not add one to each tweet, cf. expansions.py#L211.

dolsysmith commented 2 years ago

@kerchner @adhithyakiran @sebastian-nagel I pushed a new commit to t1103-exporter-v2 on the sfm-twitter-harvester repo. This version uses the twarc_csv code more efficiently than in my first attempt, so exporting to CSV/Excel/TSV seems to take time comparable to exports for v. 1.

For further discussion/testing:

  1. In the exporter class for v2, I had to duplicate the on_message method from the base class in order to invoke the twarc_csv methods downstream from the WARC iteration. The amount of duplication feels inelegant, but I couldn't think of another way to do it, short of modifying the base class itself (which seemed less than optimal). But if you can think of any improvements, feel free to try them and/or let me know.
  2. For use with larger collections: since twarc_csv uses a DataFrame as an intermediate state, there could be memory issues when producing larger files. One approach would be to disable the options in the UI (for v.2 exports) that allow the user to request more than 100K or 200K items per file (for the Excel, CSV, TSV modes). But I'll look into ways to control that behavior more programmatically. (DataFrames can append to existing CSV/Excel files, but I'm not yet sure where to incorporate that logic into the exporter code.)
  3. The field-limited JSON export is still not working. Will explore what might work after working on the above.
dolsysmith commented 2 years ago

Update on the above: latest code in the branch now implements a MAX_DATAFRAME_ROWS constant (may want to make this an environment variable instead) to prevent exporters from eating too much memory. I've set this to 25K, but we can tweak as we test.

Note that for CSV and TSV files, the app will respect the segment size selected by the user for creating the number of files (e.g., 250K, 1M, etc.) by doing an append operation. But the current Excel engine in use in the app doesn't allow appending, so the segment size for Excel exports is limited to the maximum DataFrame size. (We'll need to change the options in the UI.)

dolsysmith commented 2 years ago

Exporter performance comparison on large dataset (~1 million Tweets):