mntnr / name-your-contributors

Name your GitHub contributors; get commits, issues, and comments
MIT License
73 stars 20 forks source link

Expose a `--flat` ndjson option #30

Open RichardLitt opened 7 years ago

RichardLitt commented 7 years ago

Expose a --flat option, where:

Asked for in #16.

tgetgood commented 7 years ago

I'm not clear what the usecase is for this format. the data for a given repo all comes back at once, so there's nothing to stream. We could stream results by repo in an org, but that would be a separate return format.

Or are we talking about the output being fed into a stream? It would be simple enough to take all the extra space out of the JSON printed to the console and add a newline at the end.

--flat would indicate something entirely different to me, some kind of denesting of data. Maybe we want to go with --nd or something like that.

RichardLitt commented 7 years ago

Fair point. This may already solve @diasdavid's needs, then. I'll point him here and ask.

daviddias commented 7 years ago

An ndjson stream of documents would be 👌🏽 Thanks for considering adding this feature, @RichardLitt :)

RichardLitt commented 7 years ago

@diasdavid Can you clarify - Is the only difference a one-line JSON object with a newline at the end?

daviddias commented 7 years ago

ndjson are "new-line separated JSON objects" and the beauty of it is that lets you pipe a stream of JSON objects and start parsing them as soon as they are ready. It is also a way of sharding a very large JSON blob into smaller units that are more manageable.

tgetgood commented 7 years ago

Thanks for clarifying @diasdavid. I get what ndjson is, my question is more how this tool is to be used.

Envisaged use case 1: script is called via a batch script and the individual results streamed out as ndjson. Something like

for REPO in `cat repo-list.txt`
do
  name-your-contributors --user $ME --repo $REPO --ndjson > {STREAM_SOCKET}
done

Possible use case 2: the --org option, instead of returning a single JSON array, returns an ndjson stream. This would only be for the convenience of the caller I believe, since in my testing I've exhausted the GitHub rate limit with all query results in memory and it wasn't a problem. Maybe on a very small box, like an ec2 micro, it would be.

Use case 2 would be a bad idea probably. We're not streaming on the inside and I don't see us rewritting the script from the ground up to do so, so the script will be the bottleneck: it won't return the first NDJSON object until it's ready to return them all. So overall I think it will slow things down for the consumer, or be a wash at best.

As far as JSON documents go we're not returning much, so parsing shouldn't be a bottleneck.