[Bug Report] Unable to use XPath functions in scene scraper

KlfJoat commented 2 years ago

Describe the bug I'm trying to write a scraper for my favorite niche site and I'm running into a few problems.

I am unable to use XPath functions like concat(), string(), normalize-space(), etc., in an xPathScrapers nodeSelector. When I do, none of the XPath even shows up in the Trace logs--not as an error or partial. It's just silently dropped.

I want to concat() two XPath expressions together for the scene Details--a slugline/tagline + \n + the full scene summary.
I want to string() the output of an XPath expression to get the full string-value inside the element. The scene summaries at this site sometimes link to other related scenes or use arbitrary xhtml like <cite>, and text() chokes at the first opening tag, cutting off the description.
I want to normalize-space() another element (the one with the Date and Duration) that has wild tabs and whitespace in it for no discernible reason.
I want to be able to use these in combination. For example, I want to combine #1 and #2 above.

To Reproduce Steps to reproduce the behavior:

Given the following html

<div id="description">
<p>Sam is the best friend of Jolie (from <cite>The Pain</cite>). She gives her view on the world.</p>
<p id="data">31 May 2022 •
            640x
                480 •  26 minutes
                            • DOWNLOAD: 350.0MB
</p>
</div>

Start with a working XPath for Details in the scraper
```
Details: //div[@id='description']/p[1]/text()
```
Update & run scraper. The XPath returns
```
Sam is the best friend of Jolie (from
```

Wrap the XPath in string()

Details: string(//div[@id='description']/p[1])

Update & re-run scraper
Look at scraper pop-up and see no Details. Look at Trace logs and see no XPath for Details.

Expected behavior At step 6 above, I expect to see Details with the contents

Sam is the best friend of Jolie (from The Pain). She gives her view on the world.

Obviously that's one example. Let me know if you want more.

Stash Version: (from Settings -> About): v0.16.1, build hash 8e222ae3

MrX292 commented 2 years ago

@KlfJoat

      Details:
        selector:  //div[@id='description']/p[1]/text()
        concat: " - "

should work if you want to use concat and what is the site?

bnkai commented 2 years ago

From a quick look it seems that they library we use for xpath parsing may have an issue with the string function. For your specific case though I would recommend using stash's postprocessing functions instead as they are easier ( a use of the xpath union operator | and stash's concat post process action should do what you need). If you haven't looked the in app manual lists all available functions/pp actions https://github.com/stashapp/stash/blob/develop/ui/v2.5/src/docs/en/ScraperDevelopment.md and the community repo has a lot of scrapers that you can have a look at. For more info either paste the scraper you are working on or join our discord server and ask in #the-scraping-initiative

KlfJoat commented 2 years ago

(FYI, before anyone mentions it, I have limitations that prevent me from using python on the host I'm running Stash, so using a script to overcome the lack of proper XPath availability isn't an option.)

Thanks for the replies.

should work if you want to use concat a use of the xpath union operator | and stash's concat post process action should do what you need

When I set Details to...

Details: 
  selector: (//div[@id='tagline']/text() | //div[@id='description']/p[1]/text())
  concat: "\n"

I get this output.

Her friend thinks everything is fine.
Sam is the best friend of Jolie (from 
). She gives her view on the world.

This solves my #1 issue above. But it complicates my #2 issue by adding in a newline where the <cite></cite> tag would be. By the way, if I had #3 available to me, then I could run it through normalize-space() to fix the extra \n (though it wouldn't give me the contents of the <cite>).

and what is the site?

I'd rather not post the name publicly. It is a niche site that doesn't appear on any of the free or "free" sites bc it's so niche and bc they aggressively police their copyright. And while I've paid for literally every single video on their site, I'm worried that they may see my use of Stash as somehow a threat and close my account despite the fact that everything I'm doing is ethical. I realize that can make it hard for you to help me, but I might show up in the Discord channel and request help one day.

it seems that they library we use for xpath parsing may have an issue with the string function

@bnkai I'm not a Golang person. Could you please link me to the file or module or library where you call the XPath parser, and the parser itself? If I know what's possible or not, I may be able to work around it!

If you haven't looked the in app manual lists all available functions/pp actions ... and the community repo has a lot of scrapers that you can have a look at. For more info either paste the scraper you are working on or join our discord server and ask in #the-scraping-initiative

Thanks. I've looked at the in-app manual, the CommunityScrapers, and other sources to get me this far. Aside from the issues I opened this bug for, my draft scraper works on like 95% of the vids (the remaining 5% are early videos whose pages don't match the current framework in ways that even break on their own web site--no trailer thumbnail???). As mentioned above, I'm likely to dip into the Discord for help one of these days when I can.

scruffynerf commented 2 years ago

  selector: (//div[@id='tagline']/text() | //div[@id='description']/p[1]/text())
  concat: "\n"

yields this output:

Her friend thinks everything is fine.
Sam is the best friend of Jolie (from 
). She gives her view on the world.

So .... 1) why not concat with a non \n (you can do a regex later and fix this), say "%%", first off, so you concat a block of text. 2) the cite is lost because the text() sees the as non p, so loosen your selector a bit... hopefully the cite will be included, or you can just include the cite in your selectors 3) Now replace the %% with a \n and you can make it grab doubled %%%% (%%)+ and replace with \n now, and not have a doubled \n\n

selector: (//div[@id='tagline']/text() | //div[@id='description']/p[1]/text()| //div[@id='description']/cite/text())
concat: "%%"
postProcess:
          - replace:
              - regex: "(%%+)"
                with: "\n"

KlfJoat commented 2 years ago

why not concat with a non \n (you can do a regex later and fix this), say "%%", first off, so you concat a block of text.

the cite is lost because the text() sees the as non p, so loosen your selector a bit... hopefully the cite will be included, or you can just include the cite in your selectors

Now replace the %% with a \n and you can make it grab doubled %%%% (%%)+ and replace with \n now, and not have a doubled \n\n
selector: (//div[@id='tagline']/text() | //div[@id='description']/p[1]/text()| //div[@id='description']/cite/text())
concat: "%%"
postProcess:
          - replace:
              - regex: "(%%+)"
                with: "\n"

Doing that returns no Details per the Debug log.

Loosening the Selector's p[1] to p as you sort of suggest in your #2...

selector: (//div[@id='slugline']/text() | //div[@id='description']/p/text() | //div[@id='description']/p/cite/text())

Returns way too much of the wrong stuff in the wrong order.

Her friend thinks everything is fine.
Sam is the best friend of Jolie (from 
). She gives her view on the world.
31 May 2022 • 640x 480 •  26 minutes • DOWNLOAD: 350.0MB
The Pain

Remember, there are TWO p blocks inside of <div id="description">. If I do not specify p[1], then I get the <p id="data"> tag as well, which is not relevant here.

I appreciate that you're trying to work around not having a fully working XPath parser. But string(), concat(), and normalize-space() are the ways to handle the problems I'm having. They are how I handle similar situations in my day job. It's possible that with enough time and effort I can figure out a string of concat: and postProcess: that will replicate those functions. But I don't have the time to investigate that right now.

KlfJoat commented 1 year ago

Checking in on this. I know I've been absent for a while. I re-checked with Stash v0.18.0 Build hash: 8a649f02 and the string() XPath still doesn't work for me.

Should I try asking in the Discord? Or is there a bounty mechanism? I'm sure this would make life easier for more scrapers than just mine.

DogmaDragon commented 1 year ago

Should I try asking in the Discord?

You can try and ask in #scrapers.

Or is there a bounty mechanism? I'm sure this would make life easier for more scrapers than just mine.

Bounties are handled through Open Collective.

KlfJoat commented 1 year ago

For posterity...

I went into the Discord and @bnkai informed me...

Use //div[@id='description']/p[1] stash by default gets the inner text from elements

Running a quick check, that gives me what I want. No need to use string(). And I should be able to use XPath | to concat that with the slugline/tagline.

stashapp / stash

[Bug Report] Unable to use XPath functions in scene scraper #2818