pierre-delecto / stash_theporndb_scraper

A python script to scrape Stash data from thePornDB
MIT License
50 stars 31 forks source link

More detail when trying to scrape ambiguous scenes #3

Closed tarzanboy76 closed 4 years ago

tarzanboy76 commented 4 years ago

At the moment, the interface displays the title of the current scene being searched. If it has a fairly generic title (eg, "Hardcore") it can be difficult to match it against the options returns from TPDB. It would help to also display the site/studio name, date and ideally the performers.

Similar, the list returned from TPDB doesn't include the list of performers... including this might make it helpful for matching.

pierre-delecto commented 4 years ago

Current behaviour is to display the current query (as passed to TPBD) against the site, date, and titles returned from TPBD. We can't always add site/date/title to the query, because we can't assume those are set within Stash. If those fields are set, I recommend you disable the "scrape_with_filenames" flag to cause the query to include the date/site/title. Alternatively, if you have scenes sorted into folders, try the "test" branch build, which supports the "dirs_in_query" flag to add parent folder names to the query.

Performers in the query (for non-filename scraping) and in the ambiguous results list is possible, but a bit more tricky. I hadn't seen a use case that needs it yet, but let me know if the above doesn't resolve your issue.

tarzanboy76 commented 4 years ago

Most of my files have pretty good quality titles / studios / dates... I'm mostly scraping to confirm things, plus add descriptions and tags. So I did disable scrape_with_filenames. The problem is that searching by title when it is something generic like 'Hardcore' makes it difficult to know what entry was actually being searched... so when you get a response back from TPDB, I can't confirm which is the correct match.

I'm not a python coder, but I made the follow tweak on a local copy to help me;

     if parse_with_filename:
         try:
             if re.search(r'^[A-Z]:\\', scene['path']):  #If we have Windows-like paths
                 file_name = re.search(r'^[A-Z]:\\(.+\\)*(.+)\.(.+)$', scene['path']).group(2)
             else:  #Else assume Unix-like paths
                 file_name = re.search(r'^\/(.+\/)*(.+)\.(.+)$', scene['path']).group(2)
         except Exception:
             print("Error when parsing filename: "+scene['path'])
             return
         if clean_filename:
             file_name = scrubFileName(file_name)

         scrape_query = file_name
         print("Grabbing Data For: "+scrape_query

    else:
         scrape_query = scene_data['title']
         print("Grabbing Data For: " + scene['title'] +' ['+scene['studio']['name']+'] ('+scene['date']+')')

     scraped_data = scrapeMetadataAPI(scrape_query)

This definitely helped.

pierre-delecto commented 4 years ago

Ah! I see the issue now. The "Grabbing Data for" line happens before the built-in disambiguation for titles (where the script adds the studio and date to try to disambiguate). It should happen after, so that it reflects the updated scrape_query. The end result should be the same as your edit, just a bit cleaner code wise.

I will edit to reflect this. Thanks for the feedback!