Proposal: treat empty string as a special case (for the possible US-ASCII encoding)

monkeyWzr commented 5 months ago

Hi there @gjtorikian ! Thanks for this awesome gem.

When joining an empty array, it returns an empty string with US-ASCII encoding.

I'm wondering if Commonmarker can handle this case instead of aborting. I am happy to open a pr for that.

''.encoding # => UTF-8
[].join.encoding # => US-ASCII
Commonmarker.to_html([].join) # =>  text must be UTF-8 encoded; got US-ASCII! (TypeError)

Thanks!

kivikakk commented 5 months ago

This feels like it doesn't quite belong in Commonmarker itself; I'd be more inclined to say it's either for user code to fix (i.e. add a .encode("UTF-8") to your 'ary.join' so your arguments to Commonmark are consistent), but even more than that I think the better fix would be in Ruby itself.

This behaviour was made so in 2011. I'm not really sure why that was chosen, and I literally cannot find an archive of the (Japanese-language) ruby-list mailing list post attributed in that commit's commit message anywhere on the internet. Like, I'm sure it must exist, and I've googled a bunch in English and Japanese, and nup. There's the official archive, but it only goes back two years.

So I don't see why not, really, it's a nice compact QoL bump; either everyone who hits this possibility (empty array join) has to call .encode (and possibly only find out about this quirk after getting to production!), or we can detect the empty string explicitly to avoid raising here.

gjtorikian commented 5 months ago

I completely agree this seems like a Ruby bug.

I also have an unhealthy obsession with digital disappearances: here’s the post where the change was discussed. (To find it, I had to first find a reliable place where I could find archives; this website pointed me somewhere; the archive site has vanished, but not from the Wayback Machine.)

My (extremely rough) understanding is that this was done for no real reason. The second paragraph reads (machine translated):

Array#join typically returns a string of characters, so it is used mostly in US-ASCII. ASCII-8BIT seems good, but why ASCII-8BIT?

As the Ruby docs say:

Encoding::ASCII_8BIT is a special encoding that is usually used for a byte string, not a character string.

So if I had to put myself in the mind of a developer working on Ruby 1.9.2 in 2011, I would’ve probably just reached for “the ASCII encoding that makes sense” as the encoding result of [].join. Someone probably picked ASCII-8BIT—“the weird one”—and this commit changed it to a more common ASCII encoding. All of which is to say I don’t see a reason this shouldn’t be UTF-8.

(I’m going to close this as it’s not a Commonmarker bug, but feel free to keep discussing. I’ll follow up on one of the Ruby mailing lists as to what the Ruby behavior here should be.)

gjtorikian commented 5 months ago

Actually, it seems like someone’s jimmies were rustled around this earlier: https://bugs.ruby-lang.org/issues/14863

Given that Ruby authors want this behavior intact, I think users of Commonmarker will just need to remember to pass in UTF-8 strings as always, taking into consideration what the language does to their data. 🤷‍♂️

mvz commented 1 week ago

Maybe since we're now at Ruby 3.3 and that ticket was rejected when Ruby was at 2.4, it can be revisited?

dbussink commented 5 days ago

I think this check is incorrect:

https://github.com/gjtorikian/commonmarker/blob/123da5f617b4ae9ad792072d20fa36e8d5df5565/lib/commonmarker.rb#L38

UTF-8 is a superset of US-ASCII so you can always treat US-ASCII as UTF-8 and that is safe.

gjtorikian / commonmarker

Proposal: treat empty string as a special case (for the possible US-ASCII encoding) #277