Provide content options for HTML attribute extractor

arm1n commented 2 years ago

Hi Lukas,

first of all thanks for your greate piece of software, it works like a charm. I'd like to suggest one enhancement, which would avoid the necessity for custom extractors, if it were built into the current implementation - especially because all the required tools are already there, as it's used HtmlExtractors.elementContent.

Would it be possible to perform content normalization on extracted attribute values as well? Even though if I'm working around the missing support in HtmlExtractors.elementAttribute by using HtmlExtractors.elementContent, I'm still facing the missing content options when dealing with textPlural attribute. For this reason it would be great to have this options there as well.

Furthermore, some possibility of content sanitization would be useful as well. I had the case where JSX expressions in attributes are ending up as {'Message'} in the POT string. Offering some kind of callback in the extractor options would provide the flexibility to act on such cases.

What's your take on that? Thanks in advance!

lukasgeiter commented 2 years ago

Hey Armin, could you provide some code examples for the two cases that you mention? Thanks!

arm1n commented 2 years ago

Of course - so I'm using the HTML parsers to grab messages from views via a dedicated <FormattedMessage> component:

<FormattedMessage
  message="
    {{count}}
    multiline
    plural
    message.
  "
  plural="
    {{count}}
    multiline
    plural
    messages.
  "
  count={2}
/>;

Even if I could use HtmlExtractors.elementContent for the singular message and have whitespace, trim and indentation control, I'd be missing it for plural message, which always has to be an attribute. As such indentations aren't being rendered as such, devs in my company tend to write long messages with these line breaks, but I need to make sure that they're ending up without indentations and line-breaks in our translation tool.

The second thing is also somehow annoying as some devs are also using something like:

 <FormattedMessage message={'Message'} />
 <FormattedMessage message={'Message: with whitespace'} />

Such markup, which is valid JSX, ends up in the PO files as msgid "{'Message'}" and are even broken if it contains white space: msgid "{'Message".

Now I know that it's a HTML parser, for this reason some sanitization callback for the extractors as option would be helpful to do some RegExp work on it, maybe something like { processValue: (value: string) => value } as content option, what do you think?

arm1n commented 2 years ago

@lukasgeiter ping :) I don't want to stress you, I just want to know how to proceed - when you're willing to accept changes I'd try to work on a PR, if not, I'll go with a custom extractor. Thanks for your response!

lukasgeiter commented 2 years ago

Thanks for the details and apologies for my late response.

I agree that it would make sense to have content options for attributes as well. Feel free to implement this and open a PR 🙂

The second issue I would rather address by properly supporting JSX. Using the HTML parser for JSX is always going to cause issues like this and if you want to work around them I suggest you do so using a custom extractor.

arm1n commented 2 years ago

Thanks for the quick reply - okay then I would try to incorporate the content options into the HtmlExtractors.elementAttribute and also HtmlExtractors.elementContent (plural) and draft a PR.

I totally agree that it would be best to have a dedicated JSX extractor to avoid such problems. So if I understand you correctly you have plans to do so in the future? I can circumvent the problem above by having an eslint rule disallowing expressions for string literals only, but then I don't have to go the custom extractor route, especially when you plan to have one in the future :)

lukasgeiter commented 2 years ago

I would like to implement a JSX extractor at some point (along with many other improvements). That said, I'm currently pretty busy with other things in my life so I wouldn't hold my breath.

lukasgeiter commented 2 years ago

Released with v3.6.0

lukasgeiter / gettext-extractor

Provide content options for HTML attribute extractor #56