typelevel / Laika

Site and E-book Generator and Customizable Text Markup Transformer for sbt, Scala and Scala.js
https://typelevel.org/Laika/
Apache License 2.0
410 stars 44 forks source link

Invalid Parsing of Filenames with multiple period characters in 0.19.3 #517

Closed odenzo closed 1 year ago

odenzo commented 1 year ago

What Happens: When using a filename with multiple . the parsing for file extension starts with the first . instead of the last. Expected: .0.1.md correctly parses extension as md Version: 0.19.3

    at laika.io.runtime.ParserRuntime$.$anonfun$run$21(ParserRuntime.scala:108)
    at cats.syntax.EitherOps$.leftMap$extension(either.scala:193)
    at laika.io.runtime.ParserRuntime$.parseAll$1(ParserRuntime.scala:108)
    at laika.io.runtime.ParserRuntime$.$anonfun$run$36(ParserRuntime.scala:167)
    at flatMap @ laika.io.runtime.ParserRuntime$.parseAll$1(ParserRuntime.scala:152)
    at map @ laika.io.runtime.ParserRuntime$.$anonfun$run$36(ParserRuntime.scala:167)
    at map @ laika.io.runtime.ParserRuntime$.$anonfun$run$2(ParserRuntime.scala:59)
    at flatMap @ laika.io.runtime.ParserRuntime$.mergeInputs$1(ParserRuntime.scala:58)
    at flatMap @ laika.io.runtime.ParserRuntime$.$anonfun$run$35(ParserRuntime.scala:166)
    at delay @ fs2.io.file.FilesCompanionPlatform$AsyncFiles.isDirectory(FilesPlatform.scala:262)
    at delay @ laika.sbt.Settings$$anon$1.filter(Settings.scala:46)
    at delay @ fs2.io.file.FilesCompanionPlatform$AsyncFiles.isDirectory(FilesPlatform.scala:262)
    at delay @ laika.sbt.Settings$$anon$1.filter(Settings.scala:46)
    at traverse @ laika.helium.generate.MergedCSSGenerator$.merge(MergedCSSGenerator.scala:28)
odenzo commented 1 year ago

Also occurs in 1.0.0-M3

jenshalm commented 1 year ago

Hello and thanks for reporting.

My main question is: what is your main concern? Is it how the Path API works (e.g. what Path.suffix is reporting) or is your main concern simply the consequence of this behaviour, e.g. that some .md files are not included in the transformation?

The reason I'm asking is because the current behaviour is semi-intended (the way Path.suffix works is intentional, but some .md files being excluded and causing the transformation to fail is not). The API was designed to be working with compound suffixes, e.g. .tar.gz without additional parsing, partially because compound suffixes are heavily used internally (e.g. using .epub.css for styles which are only intended for EPUB output). And for those types of file extensions reporting baseName as the name until the last dot is not really accurate either, so while the behaviour is certainly different than in most other APIs I'm not sure it is really "invalid" given the existence of compound suffixes. Therefore, my preferred way of fixing this would be to prevent the observed issues without changing suffix parsing to "last dot only".

An API-breaking way to do this would be to turn the Path.suffix type from String to a richer type Suffix, which could come with a .last property to give you the expected (more traditional) file extension. Which means this would be possible for an upcoming 1.0 milestone, since we break some APIs anyway.

For 0.19.x what Path.suffix reports has to remain, but the logic around it (meaning how file types are determined), can be fixed by examining the reported suffix more closely.

jenshalm commented 1 year ago

Another option that is somewhat of a hack, but otherwise a pragmatic, low-risk improvement for 0.19.x at least would be to only parse compound suffixes known internally (e.g. .template.html or epub.css) and otherwise fall back to single-segment suffixes. This would not mess as much with user expectations and would fix the issues you ran into.

There'd still be room then to come up with something more polished for 1.x (which is still a few weeks away).

jenshalm commented 1 year ago

Deferring this fix to 1.0.0-M5 as the changes required to address this introduce a high risk for regressions for 0.19.x which will see its final (planned) release in a few days - and 1.0.0-M4 is also nearly ready for release.

I also cannot reproduce anything that produces a stacktrace like the one you posted, do you have any additional detail on how you got that? A transformation with markdown documents containing multiple dots does succeed, the only issue is that the generated output names strip part of the name (e.g. name.1.2.3.md becomes name.html in the output).

There are clean ways to address this in 1.x, partly because we can reduce the need for compound suffixes and then safely change the behaviour of the path parser afterwards. It's also not highest priority, as using names with dots appears to be extremely rare (this is the first report in 11 years).