linebender / norad

Rust crate for working with Unified Font Object files
Apache License 2.0
44 stars 12 forks source link

Cross-platform line ending serialization approach #162

Closed chrissimpkins closed 3 years ago

chrissimpkins commented 3 years ago

Thoughts on how to approach cross-platform line endings? Should norad write consistent line endings on all platforms (limits diff when norad serialization is used x-platform)? Keep line ending format that it reads in on subsequent serialization (maybe limits diff when norad serialization is used with other serializers)? Allow custom line ending support (flexible and maybe limits diffs across serializers and platforms)?

@belluzj mentioned today that he likes the idea of a runebender config panel with Unix vs. Win line ending style options on UFO serialization. I think that we can expand the approaches used for spacing and XML declaration quote customization to address it if this is desirable. All of these serialization format options might be useful in such a config panel.

Originally posted by @chrissimpkins in https://github.com/linebender/norad/issues/161#issuecomment-899638531

chrissimpkins commented 3 years ago

From Colin in https://github.com/linebender/norad/pull/161#issuecomment-899644633

I'd be most interested in knowing if there is an accepted convention for this. In the general case I would expect a tool to use the correct line endings for the given platform, unless there is an established convention to do otherwise?

chrissimpkins commented 3 years ago

I am not familiar with the ufonormalizer source so my quick read through may have missed something. It looks like it may use Unix style line endings across platforms?

Impl:

https://github.com/unified-font-object/ufoNormalizer/blob/6d2e69ce056c24d60d397881d56f018cad2f2475/src/ufonormalizer/__init__.py#L1502-L1505

Test:

https://github.com/unified-font-object/ufoNormalizer/blob/6d2e69ce056c24d60d397881d56f018cad2f2475/tests/test_ufonormalizer.py#L1921-L1927

madig commented 3 years ago

I think so, too. So, one more writer option to add :grimacing:

chrissimpkins commented 3 years ago

What is the optimal default approach?

madig commented 3 years ago

By default, use platform line ending. By request, use LF.

chrissimpkins commented 3 years ago

By default, use platform line ending. By request, use LF.

@cmyr this is the current consensus opinion on platform-specific line endings. No one else has chimed in to date.

cmyr commented 3 years ago

okay, great. Then the next step is to see what the current behaviour is. Are we writing out CRLF on windows, anywhere? (.glif, .plist?) we'll also need to audit anywhere where we currently write \n directly, and we'll need custom rewriting for features.fea.

chrissimpkins commented 3 years ago

I can confirm that glif files are serialized with platform-specific line endings based on the unit tests in https://github.com/source-foundry/ufofmt/pull/25

chrissimpkins commented 3 years ago

It looks like plist files serialize with platform-specific line endings too.

Win test fail on glif XML file with expected '\n' line ending: https://github.com/source-foundry/ufofmt/runs/3347607373#step:4:147

Win test fail on fontinfo.plist file with expected '\n' line ending: https://github.com/source-foundry/ufofmt/pull/25/checks?check_run_id=3373761003#step:4:147

This is based on norad serialization with the lib @ rev 1a20be6

chrissimpkins commented 3 years ago

I was tinkering with this last night and haven't landed on a good solution yet. The glif and plist data appear to be byte arrays at the stage where we are manipulating the data in the write.rs module. We could just brute force search every char and replace the entire byte array with a new u8 collection type (I suppose allocation of a Vec per UFO file is required?) that has all \r char filtered out.

I think that there should be a more optimal approach. For starters, I think that we can assume that the first line of the file (defined as everything before the first \n across platforms) is all that we need to test to understand if \r\n line endings are used. If there is a mix of line endings in a file, the user's serializer is bugged and I would consider those invalid UFO files in need of repair. So there can be a single test of "is Win style line ending present?" per file and further find/replace or filter in the file is avoided if the test is negative.

Once we know that \r\n is used, what is the most optimal approach to remove the \r line ending bytes throughout the file? I'm not aware of an approach that does not traverse the full file char set through splits/re-joins or indexed finds with filters and allocation of a new mutable data collection type. Is it possible to create an iterator over the byte array that filters on a (|b| b != '\r') test and pass that to the XML serializers?

madig commented 3 years ago

I found this piece of code: https://github.com/derekdreery/normalize-line-endings/blob/master/src/lib.rs

chrissimpkins commented 3 years ago

I found this piece of code: https://github.com/derekdreery/normalize-line-endings/blob/master/src/lib.rs

This source was perfect Nikolaus. I refactored it to use byte arrays and worked it into the File writer / serialization options here. Since we will make Win-platform only \r\n -> \n line ending changes, it is only executed on the Win platform. I'll push a PR so that we can have a look on the Win test runner. I don't have a local Win + Rust system to test. But it looks like it will work and compiles. (Famous last words...)

chrissimpkins commented 3 years ago

170

chrissimpkins commented 3 years ago

The work in #170 identified inconsistent plist and glif line endings on Win. Looking into a way to make Win serialize with \r\n and non-Win to serialize with \n across all UFO file serialization approaches.

khaledhosny commented 3 years ago

By default, use platform line ending. By request, use LF.

Please please don’t do this. It is 2021 already, use LF everywhere and be done with it. It is not like Windows applications will crash if you feed it to them (heck, even notepad supports LF now).

chrissimpkins commented 3 years ago

I don't feel strongly one way or the other on this issue but I think that we should use a consistent approach across all files. Right now there is a mixed approach across plist (\r\n on Win, \n everywhere else), glif (\n everywhere), and fea (defined by whatever was used to write the fea file) files. I see this as less of an issue with Windows reads than one of consistency with other serializers in use out there. FWIW, whatever the approach, we are working towards a simple way to customize the source serialization format so that you can minimize unnecessary XML declaration quote style, indent style, and line ending style diffs when norad is used with other UFO serializers.

khaledhosny commented 3 years ago

I think LF should always be used, I wouldn’t even provide an option to save in a different line ending, but I don’t feel strongly about it as long as it is not the default. Otherwise, one hit dependencies, mysteriously failing CI builds and other stuff that is unnecessary since pretty much anything that one would ever care about will handle LF just fine.

chrissimpkins commented 3 years ago

It doesn't work around those issues. We would need to modify plist serialization (instead of glif serialization) to follow that format. The libs used to serialize have different opinions on the matter. :)

chrissimpkins commented 3 years ago

@khaledhosny do you know if ufonormalizer uses line feeds across platforms?

madig commented 3 years ago

I'm ok with forcing LF everywhere. Feature files would need to be read and written through the LF normalizer iterator, I guess plist would need to be changed?

chrissimpkins commented 3 years ago

Yeah, we can use the approach in #170 and swap the replacement strategy. Let's decide on whether it is ok to introduce the new Unicode-free regex and lazy_static crate deps to achieve it. We should be able to do it slower with the iterator approach and no additional dependencies as a first pass. When someone has the time we can dig into how the regex crate does it in the bytes module and transition to that approach if performance without new dependencies are goals here.

chrissimpkins commented 3 years ago

Closed #170 Opened https://github.com/linebender/norad/pull/172