laxamentumtech / audnexus

An audiobook data aggregation API that harmonizes data from multiple sources into a unified stream. It offers a consistent and user-friendly source of audiobook data for various applications.
https://audnex.us/
GNU General Public License v3.0
110 stars 6 forks source link

Any reason not to use the Audible API for your "extended" genres? #432

Closed csandman closed 1 year ago

csandman commented 1 year ago

In another issue I made I remember you mentioning that you wanted this to be an API first app with heavy safeguards for anything scraped from HTML. However, when looking through your code I noticed that you're using cheerio to scrape the extended genres from each book. I was curious if you were aware of the category_ladders response group when using the Audible API, and if you are, is there any reason you're not using that instead?

For example, the response for this endpoint:

or more simply:

gives you the following object in your response:

{
  "product": {
    "asin": "B002UZKI96",
    "category_ladders": [
      {
        "ladder": [
          {
            "id": "18572091011",
            "name": "Children's Audiobooks"
          },
          {
            "id": "18572092011",
            "name": "Action & Adventure"
          }
        ],
        "root": "Genres"
      },
      {
        "ladder": [
          {
            "id": "18572091011",
            "name": "Children's Audiobooks"
          }
        ],
        "root": "Genres"
      },
      {
        "ladder": [
          {
            "id": "18572091011",
            "name": "Children's Audiobooks"
          },
          {
            "id": "18572491011",
            "name": "Literature & Fiction"
          },
          {
            "id": "18572505011",
            "name": "Family Life"
          }
        ],
        "root": "Genres"
      },
      {
        "ladder": [
          {
            "id": "18573267011",
            "name": "Education & Learning"
          }
        ],
        "root": "Genres"
      },
      {
        "ladder": [
          {
            "id": "18574784011",
            "name": "Relationships, Parenting & Personal Development"
          },
          {
            "id": "18574814011",
            "name": "Relationships"
          }
        ],
        "root": "Genres"
      }
    ]
  },
  "response_groups": [
    "always-returned",
    "category_ladders"
  ]
}

This response_group is available for both the search and the individual product details endpoints and as far as I can tell it returns all of the genres and tags you're including in your extended genre field. All you have to do is filter the genres to unique ASINs and you're good to go.

Just figured I'd let you know in case you weren't aware this field was available or ask about your reasoning for not using them if you were, as it could influence whether or not I use them in my own app.

djdembeck commented 1 year ago

I wasn't aware that response group existed (it was only added to the docs last month it seems https://github.com/mkb79/Audible/commit/05bb9994d45ae5a5d50cf2b9b7669ef959111ce4)

That being said, this is a very disorganized genre system. It might be possible to store all the categories and then check which genres exist as genres and which exist as tags. I'll think about it; thanks for bringing it up!

csandman commented 1 year ago

Another question then, is there any practical difference between genres and tags? I know they are listed under different parts of the page when viewing an individual audiobook, but they all link to different parts of Audible's category view (https://www.audible.com/categories). As in, if you go to https://www.audible.com/cat/{genre/tag asin} it works all the same regardless of which one you're using.

It seems to only really be relevant in terms of which categories Audible chooses to use for Breadcrumbs for their page navigation, for which I'm sure they have some sort of hierarchy for, but otherwise I don't think there's a difference. In fact, you can see from the category_ladders field that they're all listed as having the root Genres which would imply that there's no real difference. And as far as I can tell, the genres Audible uses for their breadcrumbs are just the first two items in the first ladder in the returned list from the API, but I'd have to look more to verify whether that is the case.

djdembeck commented 1 year ago

I use genre when a category is a parent and tag when a category is a child. You can see this list here: https://api.audible.com/1.0/catalog/categories

Using the ladders is probably the cleaner way to go and I believe would completely remove scraping on the book side. I will look into implementation once we achieve better unit test coverage.

csandman commented 1 year ago

Ah ok, then I believe it should just be the first category in each ladder as your genre and all below it as tag right?

djdembeck commented 1 year ago

Ok, gonna start testing this out this/next week now that unit tests should safely catch any discrepancies between this method and scraping.

djdembeck commented 1 year ago

So I tested this on 2 books so far. It seems the first 2 are the genres, the rest are tags. I will need to test this more since I imagine not all books have 2 categories assigned. Test cases:

    B08C6YJ1LS
    [
      { id: '18574597011', name: 'Mystery, Thriller & Suspense' },
      { id: '18574621011', name: 'Thriller & Suspense' },
      { id: '18574623011', name: 'Crime Thrillers' }
    ]

    B017V4IM1G
    [
      { id: '18572091011', name: "Children's Audiobooks" },
      { id: '18572491011', name: 'Literature & Fiction' },
      { id: '18572505011', name: 'Family Life' },
      { id: '18580606011', name: 'Science Fiction & Fantasy' },
      { id: '18572587011', name: 'Fantasy & Magic' },
      { id: '18580607011', name: 'Fantasy' }
    ]

I would imagine the safest way to ensure this, is to use a static list to check against, though I hate looping through lists. Their genre list, last I checked, was:

[
    "Arts & Entertainment",
    "Biographies & Memoirs",
    "Business & Careers",
    "Children's Audiobooks",
    "Computers & Technology",
    "Education & Learning",
    "Erotica",
    "Health & Wellness",
    "History",
    "Home & Garden",
    "LGBTQ+",
    "Literature & Fiction",
    "Money & Finance",
    "Mystery, Thriller & Suspense",
    "Politics & Social Sciences",
    "Relationships, Parenting & Personal Development",
    "Religion & Spirituality",
    "Romance",
    "Science & Engineering",
    "Science Fiction & Fantasy",
    "Sports & Outdoors",
    "Teen & Young Adult",
    "Travel & Tourism"
]

Doing so might make the sort/filter process pretty streamlined since we can automatically assign genre/tag if it matches or doesn't match.

csandman commented 1 year ago

I know that some books have no genres or tags assigned whatsoever, which could impact your testing. Here's an example of one that has no genres/tags on the book's page or in the response from category ladders:

csandman commented 1 year ago

I'm still a little confused by your definition of genre vs. tag though. When looking at a book's webpage, you have the breadcrumbs at the top. I believe these are always the first two categories listed in the first ladder returned when pulling category_ladders. However, based on your description about the difference, only the first of these two would be a genre. Are you counting these both as genres?

Otherwise, the first category in each of the ladders would be your parent genre tag. Does that sound right?

djdembeck commented 1 year ago

Essentially, genres are parent categories, and tags are categories that can be assigned regardless of a parent. Typically, I've never seen more than 2 parent categories assigned to a book. For example, let's look at Children's Audiobooks category:

        {
            "children": [
                {
                    "id": "18572092011",
                    "name": "Action & Adventure"
                },
                {
                    "id": "18572099011",
                    "name": "Animals & Nature"
                },
                {
                    "id": "18572180011",
                    "name": "Art"
                },
                {
                    "id": "18572189011",
                    "name": "Biographies"
                },
                {
                    "id": "18572204011",
                    "name": "Education & Learning"
                },
                {
                    "id": "18572258011",
                    "name": "Fairy Tales, Folk Tales & Myths"
                },
                {
                    "id": "18572433011",
                    "name": "History"
                },
                {
                    "id": "18572452011",
                    "name": "Holidays & Celebrations"
                },
                {
                    "id": "18572491011",
                    "name": "Literature & Fiction"
                },
                {
                    "id": "18572532011",
                    "name": "Music & Performing Arts"
                },
                {
                    "id": "18572548011",
                    "name": "Mystery & Suspense"
                },
                {
                    "id": "18572553011",
                    "name": "Religions"
                },
                {
                    "id": "18572561011",
                    "name": "Science & Technology"
                },
                {
                    "id": "18572586011",
                    "name": "Science Fiction & Fantasy"
                },
                {
                    "id": "18572601011",
                    "name": "Sports & Outdoors"
                },
                {
                    "id": "18572622011",
                    "name": "Vehicles & Transportation"
                }
            ],
            "id": "18572091011",
            "name": "Children's Audiobooks"
        }

There are many children/tags (names) that other parents have as well, such as Art and Action & Adventure. I say names because each child has a unique ID related to a specific parent. For the purposes of Audnexus, the ID isn't as important as the name.

Further, any software implementing Audnexus responses can simply join the genres and tags arrays and discard the type field relatively easily.

There are also different views of Audible HTML, if you weren't aware. There's a signed-in version and a signed-out version. Test it out incognito on a product details page.

csandman commented 1 year ago

Ok then this part is what I'm confused about:

So I tested this on 2 books so far. It seems the first 2 are the genres, the rest are tags. I will need to test this more since I imagine not all books have 2 categories assigned. Test cases:

When you say first 2, do you mean the first item in each of the first 2 ladders, or the first two items in the first ladder? Because as far as I can tell based on your description, the "genres" are a combination of all of the top level categories in all of the ladders. There are also definitely books with more than 2 top level categories (see my first example) and books with only 1. They are not always from the first two ladders however. For "Where the Crawdads Sing" for example:

// https://api.audible.com/1.0/catalog/products/B07FSNSLZ1?response_groups=category_ladders

{
  "product": {
    "asin": "B07FSNSLZ1",
    "category_ladders": [
      {
        "ladder": [
          {
            "id": "18574426011",
            "name": "Literature & Fiction"
          },
          {
            "id": "18574456011",
            "name": "Genre Fiction"
          },
          {
            "id": "18574461011",
            "name": "Coming of Age"
          }
        ],
        "root": "Genres"
      },
      {
        "ladder": [
          {
            "id": "18574426011",
            "name": "Literature & Fiction"
          },
          {
            "id": "18574456011",
            "name": "Genre Fiction"
          },
          {
            "id": "18574468011",
            "name": "Literary Fiction"
          }
        ],
        "root": "Genres"
      },
      {
        "ladder": [
          {
            "id": "18574426011",
            "name": "Literature & Fiction"
          },
          {
            "id": "18574482011",
            "name": "Historical Fiction"
          }
        ],
        "root": "Genres"
      },
      {
        "ladder": [
          {
            "id": "18574426011",
            "name": "Literature & Fiction"
          },
          {
            "id": "18574520011",
            "name": "Women's Fiction"
          }
        ],
        "root": "Genres"
      },
      {
        "ladder": [
          {
            "id": "18580518011",
            "name": "Romance"
          },
          {
            "id": "18580524011",
            "name": "Historical"
          }
        ],
        "root": "Genres"
      }
    ]
  },
  "response_groups": [
    "always-returned",
    "category_ladders"
  ]
}

The two top level categories here are "Literature & Fiction" and "Romance", and the "Romance" genre doesn't appear towards the end. The rest of each of the ladders describe the nesting order of the "tags" within each of those top level categories.

Overall there can be up to 3 levels of categories in a ladder. The categories API endpoint you link above displays only the top level categories and their direct children. Those children can then each have one more level of categories below them:

I would imagine the safest way to ensure this, is to use a static list to check against, though I hate looping through lists.

So this is the part I think we might have a miscommunication on, as the responses themselves seem to be consistently organized. Does what I'm saying make sense haha? And are you worried about how this breakdown compares to your previous breakdown or whether or not its consistent amongst itself?

djdembeck commented 1 year ago

I believe we are talking past each other here. Quite simply, genres are the header categories listed here: https://www.audible.com/categories. I'm sorry, I don't feel like explaining much more than that logic wise :sweat_smile:

You can see the new feature I added in https://github.com/laxamentumtech/audnexus/pull/437

djdembeck commented 1 year ago

Interestingly/cool enough, I'm also getting genres on API-only books (no longer available in the store), so that's an awesome side effect here!

csandman commented 1 year ago

Interestingly/cool enough, I'm also getting genres on API-only books (no longer available in the store), so that's an awesome side effect here!

Yeah that's the main reason I thought it would be beneficial to use it originally haha, definitely more consistent.

And I just checked out the code! What I was saying before was that you shouldn't have to check the categories against a static list at all. If you just take the first category from each ladder and filter to the unique ones, you'll have your genres, and the rest are tags.

const product = {
  asin: "B002UZKI96",
  category_ladders: [
    {
      ladder: [
        {
          id: "18572091011",
          name: "Children's Audiobooks",
        },
        {
          id: "18572092011",
          name: "Action & Adventure",
        },
      ],
      root: "Genres",
    },
    {
      ladder: [
        {
          id: "18572091011",
          name: "Children's Audiobooks",
        },
      ],
      root: "Genres",
    },
    {
      ladder: [
        {
          id: "18572091011",
          name: "Children's Audiobooks",
        },
        {
          id: "18572491011",
          name: "Literature & Fiction",
        },
        {
          id: "18572505011",
          name: "Family Life",
        },
      ],
      root: "Genres",
    },
    {
      ladder: [
        {
          id: "18573267011",
          name: "Education & Learning",
        },
      ],
      root: "Genres",
    },
    {
      ladder: [
        {
          id: "18574784011",
          name: "Relationships, Parenting & Personal Development",
        },
        {
          id: "18574814011",
          name: "Relationships",
        },
      ],
      root: "Genres",
    },
  ],
};

const allCategories = product.category_ladders.map(({ ladder }) => ladder);

// First item from each ladder is parent category (genre)
const allGenres = allCategories.map((ladder) => ladder.shift());
const genres = [...new Map(allGenres.map((item) => [item.name, item])).values()];

// The rest in each ladder are the child categories (tags)
const allTags = allCategories.flat();
const tags = [...new Map(allTags.map((item) => [item.name, item])).values()];

console.log(genres);
/* [{
  id: "18572091011",
  name: "Children's Audiobooks"
}, {
  id: "18573267011",
  name: "Education & Learning"
}, {
  id: "18574784011",
  name: "Relationships, Parenting & Personal Development"
}] */

console.log(tags);
/* [{
  id: "18572092011",
  name: "Action & Adventure"
}, {
  id: "18572491011",
  name: "Literature & Fiction"
}, {
  id: "18572505011",
  name: "Family Life"
}, {
  id: "18574814011",
  name: "Relationships"
}] */

https://jsfiddle.net/b0s6pram/8/


EDIT: So that is, unless you're saying that for something like "Literature & Fiction" which is being used as a child category in this case, you want it to be used as a genre because in some cases it is used as a top level category. If that's the case then yes, I suppose you'd have to check it against the list.