stashapp / metadata-api-discuss

This repo is the laziest possible way we can have threaded conversations about metadata collection and curation for StashApp
MIT License
6 stars 1 forks source link

Aggregating adult performers metadata - authority file / schema discussion #10

Open laurus-lx opened 2 years ago

laurus-lx commented 2 years ago

Currently stashbox supports only single "source of truth" for scenes/performers/studios, where as performer data aggregated from various sources (index sites, tubes, social media, studios) may dither with varying degree of confidence

This is a proposal to create authority file that will:

  1. Have a list of data sources (sites)
  2. Have a regularly updated scrape of scenes/performers metadata
  3. Keep track of metadata as it changes over time
  4. Normalize metadata (birthdays/locations/scene dates and titles/ performer physical attributes)
  5. Generate periodic snapshots: a. Assign confidence value to performer matches across sources - link and de-dup performers b. Assign confidence value to metadata and de-dup c. Generate output scenes/performers/studios dump

image

There is a discussion regarding adding that functionality to stash-box itself https://discord.com/channels/559159668438728723/798641040029777980/894662081830322206

Whether this will be integrated in to stashbox, or kept separate - we need to come up with a schema, so wanted to start this discussion.

laurus-lx commented 2 years ago

Query for pulling external identifiers from wikipedia / wikidata (credit Tweeticoats - discord):

https://query.wikidata.org/#SELECT%20%3Fpornographic_actor%20%3Fpornographic_actorLabel%20%3Fdate_of_birth%20%3Fmass%20%3Fheight%20%3Feye_color%20%3Feye_colorLabel%20%3Fhair_color%20%3Fhair_colorLabel%20%20%3Fsex_or_gender%20%3Fsex_or_genderLabel%20%3Fplace_of_birth%20%3Fwork_period_start%20%3FTwitter_username%20%3FInstagram_username%20%3FPornhub_ID%20%3FFacebook_ID%20%3FIMDb_ID%20%3FIAFD_female_performer_ID%20%3FIAFD_male_performer_ID%20%3FAdult_Film_Database_actor_ID%20%3Fyouporn_ID%20%3FRedTube_ID%20%3FAVN_performer_ID%20%3FAWMDB_performer_ID%20%3FOnlyFans_ID%20%3FEGAFD_ID%20%3FxHamster_performer_ID%20%3FTMDb_person_ID%20%3FXXXBios_female_performer_ID%20%3FXXXBios_transgender_performer_ID%20%3FModelhub_ID%20%3Fofficial_website%20%3FVIAF_ID%20%3FPenthouse_ID%20%3FSnapchat_username%20%3FTwitch_channel_ID%20%20%3Fimage%20%3FCommons_category%20WHERE%20%7B%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%22.%20%7D%0A%20%20%3Fpornographic_actor%20wdt%3AP106%20wd%3AQ488111%20.%0A%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP569%20%3Fdate_of_birth.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP2067%20%3Fmass.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP2048%20%3Fheight.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP1340%20%3Feye_color.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP1884%20%3Fhair_color.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP21%20%3Fsex_or_gender.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP19%20%3Fplace_of_birth.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP2031%20%3Fwork_period_start.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP2002%20%3FTwitter_username.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP8718%20%3FFacebook_ID.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP2003%20%3FInstagram_username.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP5246%20%3FPornhub_ID.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP3869%20%3FIAFD_female_performer_ID.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP4505%20%3FIAFD_male_performer_ID.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP3351%20%3FAdult_Film_Database_actor_ID.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP4267%20%3Fyouporn_ID.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP5540%20%3FRedTube_ID.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP8718%20%3FAVN_performer_ID.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP8721%20%3FAWMDB_performer_ID.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP8604%20%3FOnlyFans_ID.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP8767%20%3FEGAFD_ID.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP8720%20%3FxHamster_performer_ID.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP345%20%3FIMDb_ID.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP4985%20%3FTMDb_person_ID.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP9233%20%3FXXXBios_female_performer_ID.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP9174%20%3FXXXBios_transgender_performer_ID.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP8280%20%3FModelhub_ID.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP856%20%3Fofficial_website.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP214%20%3FVIAF_ID.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP6290%20%3FPenthouse_ID.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP2984%20%3FSnapchat_username.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP5797%20%3FTwitch_channel_ID.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP18%20%3Fimage.%20%7D%0A%20%20OPTIONAL%20%7B%20%3Fpornographic_actor%20wdt%3AP373%20%3FCommons_category.%20%7D%0A%0A%7D%0A

laurus-lx commented 2 years ago

For collaborating on performers authority file - think the easiest way to proceed will be to share scraped performers data using torrents or file-hosting sites. Meta-data can be packed in to json. We'll then have Extract/Transform/Load script pull this files and transform them in to a usable dataset (perform cross-referencing/normalization/validation), so anybody can replicate the process without relying on any central host. List of source would get periodically expanded with new sites and updates from existing sites.