Line breaks - Githubissues

bitdivine commented 10 years ago

May I suggest that '\r' is also supported as a line break? I have been using JSON lines for a while now and Mac users do sometimes give me files where lines are terminated by '\r'. It is as natural for them as '\n' is natural for Linux programmers.

I do believe that JSON Lines is a very convenient format and hope to keep it that way for all users.

wardi commented 10 years ago

I thought macs have been producing unix-style text files for a while now, no?

wardi commented 10 years ago

I'm resisting this suggestion because it seems like an easy thing to fix on the producing side, but it's a tricky thing to implement on the receiving side. You can't just accept '\r' as well or a file from a windows user would produce blank (invalid) json between each '\r\n' pair.

bitdivine commented 10 years ago

I do understand. People upload to one of my sites using JSON lines or CSV and I do still get \r terminators and I do just sort that out then and there. I haven't bothered to check who those people are. I suspect that some people have kept editor settings from older machines. I'm not certain whether my code everywhere else treats \r correctly. I usually use something like:

"a\n\r  b".split(/[\n\r]/).map(function(s){return s.trim()}).filter(function(s){return s.length > 0;}).whatever

You will note that this deals with empty lines from windows and/or from the end of a file or pipe. Perhaps breaking on '\r' can be considered inessential good practice.

I bumped into your page only today and it prompted me to start putting some of my command line tools on github. They will need tidying up. They're simple but very useful. I'm not sure that the stats algorithms I usually pipe the data into would have a wide appeal but the basic manipulators might.

wardi commented 10 years ago

for sure, if you put something up I'd be happy to link to it.

Removing blank lines is also something I'm not keen on because it affects the line count when using text editors or tools like sed.

bitdivine commented 10 years ago

Thanks. I have rewritten the most useful functions and put them here: https://github.com/bitdivine/jline There are quite a few more to come but this is a start.

As you are an avid sedder I have provided the original line number in the underlying code, so that if empty lines, comments and junk are removed you can still track back to the original. Apropos comments, one very important thing in my life is providing proof that numbers are correct. A forensics investigation would collapse if there wasn't a very solid trail showing the derivation of the stats. Nothing is worse than discovering a file and not remembering exactly where it came from. So comments are very important. When processing, I prepend a comment at each stage, so I get:

# Annotate records with zzz == step 3
# Filter out XXX == step 2
# Collected from <somehost> on <date> == step 1
{...}
{...}

If there is ever a question of whether to add comments to the formal spec, and the question does invariably come up, can I put my vote in for an initial #? It's simple. And I have masses of such files. For the coder I've provided levels of pedantry so a given user can decide how strict they want to be about allowing just pure JSON. It would be a shame if JSON lines went the way of XML with lots of tools but all too pedantic for real world use.

bitdivine commented 10 years ago

There were a few bugs after my rewrite but the code is fairly stable now. Have you played with it? Were there any issues or were there any "wouldn't it be nice if"s?

wardi / jsonlines

Line breaks #3