Closed KonradIT closed 3 weeks ago
Hey! To be honest, I'm not sure this is the sort of thing we would want to implement. Right now, the output is JSONL, which is a pretty standard and portable format (and in line with twarc's behavior, on which we're modeling the project). Outputting to files is pretty simple as well (just redirect stdout to a file).
I worry that adding additional output formats would complicate things unnecessarily. Is there a particular use case you're thinking of where gogettr-internal support would be necessary?
Ah, TIL about jsonl; the use case that I can see would be dumping new posts to a sqlite DB, or similar, which makes searching/analyzing much easier. Perhaps this could be part of a helper script and not a core function of gogettr
.
Similar to Twitter, there's not really a standard way to convert to CSV because of arrays (e.g. utgs
, activity
, pinpsts
), some of which can have any number of elements — you either need to make a column for each array element or reformat them so they all appear in one column together.
What I typically do when I need a CSV is use visidata to open the jsonl, expand arrays if desired, and then save to CSV (or a SQLite db). We could also have a helper script like twarc's json2csv.py, if you want to try that out.
I ran into an issue where the print(json.dumps())
in the CLI app is outputting escaped unicode:
I fixed it in my branch (https://github.com/KonradIT/gogettr/commit/2fdbed4fa9a2501073c6de2e533e7225773dffc4), is this something that needs to be addressed? I can submit a PR.
@KonradIT Can you show me before/after of the same post with ensure_ascii True/False? My impression is that it will still be ascii-encoded UTF-8 in the results.
Got it, thanks — in that case, just making not_ensure_ascii the default behavior and not including a CLI flag to toggle it. Any reason not to?
cant see any reason.
The CLI client outputs JSON objects on new lines to stdout, which might not be ideal for parsing on other programs. Ideally the software would allow for a desired data formats to be set, for instance TSV/CSV instead of JSON. Also, outputting directly to a sqlite DB or other file.
Also, the JSON objects could be outputted with a comma at the end to make parsing on other programs easier.
What do you all think? I could work on it and submit draft PRs.