j0k3r / graby

Graby helps you extract article content from web pages
MIT License
363 stars 73 forks source link

Strip XPath not working for some reason #288

Closed bwa- closed 2 years ago

bwa- commented 2 years ago

I cannot get stripping a simple element from the DOM to work on this site:

https://www.heise.de/news/Minijobs-Co-Regierung-verzichtet-vorerst-auf-digitale-Arbeitszeiterfassung-6522113.html?view=print

The element I'm trying to remove is the one with the XPath //a-collapse. For some reason though, it is never matched.

I noticed that this element is actually stripped in the default config for that site: https://github.com/j0k3r/graby-site-config/blob/master/heise.de.txt#L73

That does not work, however. I've looked at the code of the ContentExtractor and could not find any problem with it at least as far as I understand it. I've tried changing the XPath to all of the following, none of which worked:

Finally, I tried replacing the dash character like this: replace_string(a-collapse): acollapse That does work (I changed the extractor to log the $html variable to check) but it does not change the outcome - the element will not be stripped.

Kdecherf commented 2 years ago

Are you sure that a-collapse is still in the output? As this element is not a standard tag, it should be stripped by Readability before reaching the Graby's stripping phase.

bwa- commented 2 years ago

Ah, interesting. I wasn't aware non-standard elements were filtered out before graby even begins working. I changed the config to work on a tag that is actually in the article in Wallabag itself - that's done it.

I've tried to find out if the docs (https://doc.wallabag.org/en/user/errors_during_fetching.html) make mention of this processing order and couldn't find anything. Do you happen to know if the docs are hosted on github as well? If so, I'd like to make a pull request with a small addition to the docs.

Thanks! :+1:

Kdecherf commented 2 years ago

Do you happen to know if the docs are hosted on github as well? If so, I'd like to make a pull request with a small addition to the docs.

Sure, PR on wallabag doc are welcomed here: https://github.com/wallabag/doc