Feature request: md2po argument to exclude lines based on character matching

mondeja / mdpo

Markdown files translation using GNU PO files

https://mondeja.github.io/mdpo/

BSD 3-Clause "New" or "Revised" License

25 stars 6 forks source link

Feature request: md2po argument to exclude lines based on character matching #227

Closed joelnitta closed 2 years ago

joelnitta commented 2 years ago

Instead of disabling extraction by using comments in the md file, I would like to exclude lines by providing a pattern to md2po (probably some flavor of grep, but at least literal matching).

Example use case: creating a PO file for an md file with pandoc fenced divs, such as this. I would like to be able to exclude all lines starting with three colons (:::).

mondeja commented 2 years ago

I think that a regex would be the pattern used, but I need to think about that because I'm not convinced about to parse the content. Seems a bit overkill for mdpo.

The current workaround is to use events.

Giving the file foo.md:

:::::::::::::: foo

Foo

::::::::::::::::

...and a file ignorer.py:

def on_text(md2po, block, text):
    if text.startswith(':::'):
        md2po.disable_next_block = True

If you execute md2po foo.md -e text:ignorer.py::on_text, you'll get as output:

#
msgid ""
msgstr ""

#: foo.md:block 2 (paragraph)
msgid "Foo"
msgstr ""

See the reference for the exposed API of md2po instance passed as first parameter of the event.

joelnitta commented 2 years ago

Thanks for the quick reply! At the moment, my workaround for both this ~~and #228~~ is to pre-process the MD file, then generate the PO file from that. The pre-processing step excludes ~~the YAML header and~~ pandoc fenced divs by automatically adding  etc before the relevant lines.

Eventually, it would be preferable for this to become part md2po of so that I don't need an extra file (either ignorer.py or the pre-processed MD).

(edit: sorry, I realized this didn't apply to the YAML header; in that case I post-process the resulting MD file to fix the YAML header)

mondeja commented 2 years ago

Could you clarify what would be the exact behaviour of this? Would be the pattern a matcher for an entire Markdown block or just part of blocks? I'm not really sure what you're asking for, for example:

:::::::::::::: foo
Foo
::::::::::::::::

Should this hipotetical new option ignore the ::: parts of the paragraph? Or only when are defined in separate paragraphs?

:::::::::::::: foo

Foo

::::::::::::::::

Would an user will try to include in the matcher Markdown syntax, for example, including - in list items? Because that is impossible to accomplish with the current parser, MD4C:

- foo
- ignore this
- bar

Should I define the value to match with ignore this or with - ignore this?

I see this request so much inclined to your use case and the implementation not clarily defined. If you can solve the problems stated I can consider it. Of course, you're always free to open a PR.

joelnitta commented 2 years ago

Sorry if it wasn't sufficiently clear...

I would not try to implement it on the paragraph level, but rather at the level of lines: If a line contains the matched text, it would be excluded from the PO file.

So in that case both

:::::::::::::: foo
Foo
::::::::::::::::

and

:::::::::::::: foo

Foo

::::::::::::::::

would only present Foo to the PO file.

For your second example, let me explain with pseudo-code. I imagine something like this:

md2po input.md --exclude_lines -

And it would exclude all of the lines in

- foo
- ignore this
- bar

because each contains -.

mondeja commented 2 years ago

MD4C (the parser that uses mdpo based in the CommonMark spec) does not parse line by line but block by block, so I can't implement this and I have no motivation to create a low-level line by line Markdown parser that maintains the necessary speed.

There is a PR opened in MD4C to implement the syntax part of the parsing but its author hasn't the motivation to end and maintain it. As always, PRs are welcome in MD4C, PyMD4C and mdpo.

joelnitta commented 2 years ago

I see, thanks for explaining.

I was under the impression that exclusion / inclusion could be controlled line-by-line because of the existence of both

 or 

and

 or 

Are each of those pairs aliases? In other words, do they all only exclude/enable by block, not by line?

mondeja commented 2 years ago

Yes,  is just syntactic sugar for . See #211

patrickbard commented 2 years ago

MD4C (the parser that uses mdpo based in the CommonMark spec) does not parse line by line but block by block [...]

Does that mean if I had a similar request, for block exclusion, it would be a possibility?

I have some files that I want to include only paragraphs and ignore all headers. Is there a better way to do it than adding  before all headers?

mondeja commented 2 years ago

Does that mean if I had a similar request, for block exclusion, it would be a possibility?

Sure, PRs welcome.