stanfordio / gogettr

Public API client for GETTR, a "non-bias [sic] social network," designed for data archival and analysis.
Apache License 2.0
89 stars 23 forks source link

Output format / issues #7

Closed KonradIT closed 3 weeks ago

KonradIT commented 3 years ago

The CLI client outputs JSON objects on new lines to stdout, which might not be ideal for parsing on other programs. Ideally the software would allow for a desired data formats to be set, for instance TSV/CSV instead of JSON. Also, outputting directly to a sqlite DB or other file.

Also, the JSON objects could be outputted with a comma at the end to make parsing on other programs easier.

What do you all think? I could work on it and submit draft PRs.

milesmcc commented 3 years ago

Hey! To be honest, I'm not sure this is the sort of thing we would want to implement. Right now, the output is JSONL, which is a pretty standard and portable format (and in line with twarc's behavior, on which we're modeling the project). Outputting to files is pretty simple as well (just redirect stdout to a file).

I worry that adding additional output formats would complicate things unnecessarily. Is there a particular use case you're thinking of where gogettr-internal support would be necessary?

KonradIT commented 3 years ago

Ah, TIL about jsonl; the use case that I can see would be dumping new posts to a sqlite DB, or similar, which makes searching/analyzing much easier. Perhaps this could be part of a helper script and not a core function of gogettr.

lxcode commented 3 years ago

Similar to Twitter, there's not really a standard way to convert to CSV because of arrays (e.g. utgs, activity, pinpsts), some of which can have any number of elements — you either need to make a column for each array element or reformat them so they all appear in one column together.

What I typically do when I need a CSV is use visidata to open the jsonl, expand arrays if desired, and then save to CSV (or a SQLite db). We could also have a helper script like twarc's json2csv.py, if you want to try that out.

KonradIT commented 3 years ago

I ran into an issue where the print(json.dumps()) in the CLI app is outputting escaped unicode:

image

I fixed it in my branch (https://github.com/KonradIT/gogettr/commit/2fdbed4fa9a2501073c6de2e533e7225773dffc4), is this something that needs to be addressed? I can submit a PR.

lxcode commented 2 years ago

@KonradIT Can you show me before/after of the same post with ensure_ascii True/False? My impression is that it will still be ascii-encoded UTF-8 in the results.

KonradIT commented 2 years ago

imagen

lxcode commented 2 years ago

Got it, thanks — in that case, just making not_ensure_ascii the default behavior and not including a CLI flag to toggle it. Any reason not to?

KonradIT commented 2 years ago

cant see any reason.