j0k3r / graby

Graby helps you extract article content from web pages
MIT License
362 stars 73 forks source link

Question for wrap_in() #340

Open HolgerAusB opened 7 months ago

HolgerAusB commented 7 months ago

Question to wrap_in(), from here: https://github.com/j0k3r/graby/pull/262

I love it! @Kdecherf is there a reason, why wrapping is only allowed with blockquote, p and div? Or did I just read the code wrong? Could also be helpful for any other tag-pair like b, strong, em, h2 and even foobar. So if this is done before most other things, I could strip or string_replace 'foobar'. Which currently is sometimes difficult because of auto-stripping span-tags in wallabag.

@fivefilters could we please have this COOOOL feature for FTR and P2K, too?

I used it here:

Kdecherf commented 7 months ago

Hello @HolgerAusB,

I've put an allow-list in order to respect HTML semantic usage: blockquote, p and div are allowing flow content tags whereas b, strong and others only accept phrasing content tags, see:

HolgerAusB commented 7 months ago

@Kdecherf Understood. We can't set attributes for the wrap, can we? Like class, id, style... ok, last one would be nasty, I know.

HolgerAusB commented 7 months ago

@Kdecherf another idea would be to have a new wrap_text_in(<tag>): <xpath> where only the found text() within matching xpath could be wrapped in <strong>, <em>, <h2> or many others.

Example

source:

<div class="foobar bold column stress">
    <p>
        Lorem ipsum dolor sit amet, consectetur adipisici elit...
    </p>
</div>

site-config:

wrap_text_in(em): //div[contains(@class, 'stress')]
wrap_text_in(strong): //div[contains(@class, 'bold')]

results in:

<div class="foobar bold column stress">
    <p>
        <em>
            <strong>
                Lorem ipsum dolor sit amet, consectetur adipisici elit...
            </strong>
        </em>
    </p>
</div>