mquinson / po4a

Maintain the translations of your documentation with ease (PO for anything)
http://po4a.org/
GNU General Public License v2.0
121 stars 58 forks source link

Text.pm regression since v0.58: YAML::Tiny failed to classify line ... #365

Closed jonassmedegaard closed 2 years ago

jonassmedegaard commented 2 years ago

Text.pm chokes on a text containing a markdown horisontal ruler at the very top of the file.

This file was processed correctly until and including po4a v0.57 but fails with v0.58 onwards.

Problem is the introduction of the YAML Front Matter parser: Code wrongly treats leading triple-dash as always meaning beginning of a YAML header, but YAML metadata is optional for markdown text and the proper is therefore to only try parse as YAML but gracefully fallback to parsing as markdown in case of failure.

For now I have applied the workaround of using four dashes, but I am confident that the Text.pm module should be fixed.

mquinson commented 2 years ago

Hello, thanks for reporting.

Is it really an issue? The only other possibility would be to allow po4a to silently fail on malformated YAML headers, and I feel bad about it. The experience show that people only look at po4a error messages when the build fails. I'm afraid of people not noticing important breakages if we allow the YAML header to fail.

If you insist that your file should be allowed to have --- in the header (please tell me why you need it), then I'll add an option to allow you to explicitly ignore the --- in the header and not even try to parse the YAML after it.

Thanks

jonassmedegaard commented 2 years ago

I understand that this is a cornercase: I can only imagine it being rare to have markdown content that begins with a horisontal ruler - and additionally expressing that using exactly three dashes.

I do think, however, that it is not totally weird to have scenarios like mine: a markdown text being a footer, where that footer should have a horisontal ruler as its very first "word". Also, I do think it is not uncommon for users of markdown to use YAML optionally - i.e. include YAML headers in some files but not others. Which means dsabling YAML parsing would be less ideal.

How about by default to parse YAML loosely, and have a strict option that makes it fail instead of gracefully processing as markdown?

I mean, fundamentally markdown was intended as failing gracefully, deliberately choosing markup that would be reasonably readable if not understood or mistyped. (that's the reason I propose loose YAML parsing to be default, not as you implemented till now having YAML parsing be strict)

mquinson commented 2 years ago

I understand your point, but I'm still not convinced. Maintainers don't look at po4a output, even less at the files produced by po4a. I don't think it's safe to go for the loose default. I will extend the error message that you get when the file begins with ---. Take this as a proposal for now, and keep giving your opinion please.

jonassmedegaard commented 2 years ago

What you propose is to continue letting po4a dictate how to write content. I find that ugly - sure, only for a cornercase, but still.

Please if you do not want to implement loose-by-default then at least implement loose-as-option and have the error message mention that option as an alternative.

mquinson commented 2 years ago

I'm 100% OK with the implementation of loose-as-option, and won't close that bug until it's done. But it's not completely trivial either. I guess that one would have to save the lines and references that you get from $self->shiftline() in Text.pm::parse_markdown_yaml_front_matter() and then reinject them with $self->unshiftline() if that's a failure.

But I'm still wondering what kind of data structure I should use to save them. Either a list of vectors (ctn,ref), or a plain list with an alternance of both elements. I'm also wondering whether I should unshift starting from the oldest shifted element, or in reverse order. I guess reverse order.

If "someone" comes up with a patch implementing the loose-by-default (ie, just continuing when the YAML does not parse correctly), I'll play with the po4a option things to implement the explicit-loose (ie, activate the provided behaviour only if explicitly asked by the user).

Does it sound like a plan? ;)

jonassmedegaard commented 2 years ago

Sounds like a plan to me.

I wonder (without volunteering to dive in, however) if it really needs to be this complicated to solve:

Current code does something like this:

  1. Unconditionally call function parse_markdown_yaml_front_matter()
  2. Process any top lines hinted as if maybe perhaps being YAML
  3. Call YAML::Tiny (which fails if in fact it wasn't YAML after all)

Seems to me a simple solution would be to call YAML::Tiny twice, like this:

  1. Try call YAML::Tiny, and if it succeeds then call function parse_markdown_yaml_front_matter()
  2. Process any top lines hinted as if maybe perhaps being YAML (but known ahead that in fact it is)
  3. Call YAML::Tiny (which is known to not fail)

More elegant might be to only call YAML::Tiny once, passing its result to the function - or get rid of the separate function altogether: It is only called once ;-)

mquinson commented 2 years ago

One difficulty is to get the input from where it is (using TransTractor.pm::shiftline()), try to use YAML::Tiny on that content, and if it fails, reinject the data within the TransTractor so that the classical parser gets it.

I was only describing this part "pop the data, but make sure you can reinject it in case of problem" earlier. This is perfectly complementary with what you describe here.

jonassmedegaard commented 2 years ago

Oh, I understand the complication now.

Sorry, if core code has no way of "rolling back" if hitting a dead end, then indeed my "simple approach" is not so simple after all. :-(

mquinson commented 2 years ago

the core code (TransTractor) has a rollback solution, but you have to unshift content+reference in the right order. Not necessarily difficult, but it has to be taken into account.

jonassmedegaard commented 2 years ago

Right. I clearly didn't account for that (I have "only" juggled Perl for ~25 years - haven't familiarized myself with iterators yet).