tboothman / imdbphp

PHP library for retrieving film and tv information from IMDb
247 stars 84 forks source link

imdb new format "Release info" page #291

Closed mam4dali closed 1 year ago

mam4dali commented 1 year ago

Hello Today I noticed some problems that make this library not work properly After checking, I noticed that a new format can be seen on the release information page It is different from the previous structure and causes it to not work properly pic: The-Quiet-Girl-2022-Release-info-IMDb

html source: html source.txt

I think we should adjust to the new structure soon

jreklund commented 1 year ago

Yeah, they have changed multiple pages to their new format. You can also get the old one sometimes, so they are still rolling them out on all servers. So if you change the code for their new design and re-run the test, you get old one and it fails.

We are either stuck with limited data (what is shown in the picture) or someone need to "crack" how their graphql works, as it's the only way to get the rest of the data. You need to be sending an correct variables and extensions or it will fail, so they do a check for bots.

https://caching.graphql.imdb.com/?operationName=TitleReleaseDatesPaginated&variables={data}&extensions={data}
Content-Type: application/json
GeorgeFive commented 1 year ago

Just to keep everything in one place, this also affects Companies page. Same thing.

Thomasdouscha commented 1 year ago

It cannot get release info.

mam4dali commented 1 year ago

I have spent the last few days testing and reviewing. These days I can get the information with the following commands Also, by changing the imdb id in the data, everything works correctly However, I think this way should be further investigated I am confused about sha256Hash value

curl --location -g --request GET "https://caching.graphql.imdb.com/?operationName=TitleAkasPaginated&variables={\"after\":\"\",\"const\":\"tt15576994\",\"first\":200,\"locale\":\"en-US\",\"originalTitleText\":false}&extensions={\"persistedQuery\":{\"sha256Hash\":\"180f0f5df1b03c9ee78b1f410d65928ec22e7aca590e5321fbb6a6c39b802695\",\"version\":1}}" --header "authority: caching.graphql.imdb.com" --header "accept: application/graphql+json, application/json" --header "accept-language: en-US;q=0.9,en;q=0.8" --header "cache-control: no-cache" --header "content-type: application/json" --header "origin: https://www.imdb.com" --header "pragma: no-cache" --header "referer: https://www.imdb.com/" --header "user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36" --header "x-imdb-client-name: imdb-web-next-localized" --header "x-imdb-user-country: US" --header "x-imdb-user-language: en-US"

tboothman commented 1 year ago

https://www.apollographql.com/docs/apollo-server/performance/apq/ Presumably that query will work using the hash until one day it doesn't any more.

I had a look at the graphql endpoint a while ago, and it was promising but didn't have all the information. Still, most the info from an endpoint that returns json is nicer than html parsing https://github.com/tboothman/imdbphp/issues/221#issuecomment-873419219

GeorgeFive commented 1 year ago

Knew it was coming, but I'm no longer receiving any company data, at all. Assuming the new page has rolled out completely, and though company info is the only thing I am affected by, I would assume it affects releases and akas and whatever else. Used to work after a refresh or two, but now, nothing.

Thomasdouscha commented 1 year ago

Please someone should fix it.

jreklund commented 1 year ago

Here is what I know, somebody can figure out the rest.

function y() {
    var n = (0, s.Z)(["\n    query TitleAkasPaginated($const: ID!, $first: Int!, $after: ID) {\n        title(id: $const) {\n            akas(\n                first: $first\n                after: $after\n                sort: { by: COUNTRY, order: ASC }\n            ) {\n                ...AkaItem\n            }\n        }\n    }\n    ", "\n"]);
    return y = function() {
        return n
    }, n
}
function B(e, t) {
    var n = k(e),
        r = e.context.url;
    if (!n || !t) return r;
    var i = [];
    t.operationName && i.push("operationName=" + encodeURIComponent(t.operationName)), t.query && i.push("query=" + encodeURIComponent(t.query.replace(/#[^\n\r]+/g, " ").trim())), t.variables && i.push("variables=" + encodeURIComponent(N(t.variables))), t.extensions && i.push("extensions=" + encodeURIComponent(N(t.extensions)));
    var o = r + "?" + i.join("&");
    return o.length > 2047 ? (e.context.preferGetMethod = !1, r) : o
}

I guess it's just easier to store the sha256 hash instead of downloading this file, parsing it and generate a hash, just to send a cached request. If they change the query the hash will not match anymore anyway.

With all the changes done to Imdb maybe it's time to just close down chop and migrate to Tmdb.

GeorgeFive commented 1 year ago

Personally, I'd love to see someone crack this problem and get a fix in so we can continue with IMDb. I like TMDb, but they don't hold a candle to IMDb in terms of the data that they offer.

jreklund commented 1 year ago

php-tmdb/api are the most up to date PHP version I know about and started to look at before I did the last round of fixes to imdbphp. There are some things not present on TMDB and vice versa. I got akas and parental guidance (exklusive) in my application.

The data to find the hash are all there, but I haven't seen the obvious bit where they actual do the computing. GenerateHash are there, but haven't seen the actual conversion. I have tried to pass the whole query into sha256 without luck, so I'm missing some data (or need to remove some).

But If we rely on calculating sha256 we need to grab those queries every time to have a valid hash. Or store them statically.

First we do a normal query to the website (or Graph?) and later on multiple queries to get all extra data so lot of re-writes need to happen.

We could go with limited data approach like #292 but I would remove that method from my application instead. All or nothing.

tboothman commented 1 year ago

It's a bit tedious to explore the schema it's very possible to get this data from the graphql api Here's the schema for the Title object: image Results of fetching the releasedates: image Here's the query to fetch information about a GraphQL type:

query Type($type: String!) {
  __type(name: $type) {
    ...FullType
  }
}

fragment FullType on __Type {
      kind
      name
      description

      fields(includeDeprecated: true) {
        name
        description
        args {
          ...InputValue
        }
        type {
          ...TypeRef
        }
        isDeprecated
        deprecationReason
      }
      inputFields {
        ...InputValue
      }
      interfaces {
        ...TypeRef
      }
      enumValues(includeDeprecated: true) {
        name
        description
        isDeprecated
        deprecationReason
      }
      possibleTypes {
        ...TypeRef
      }
    }

    fragment InputValue on __InputValue {
      name
      description
      type { ...TypeRef }
      defaultValue

    }

    fragment TypeRef on __Type {
      kind
      name
      ofType {
        kind
        name
        ofType {
          kind
          name
          ofType {
            kind
            name
            ofType {
              kind
              name
              ofType {
                kind
                name
                ofType {
                  kind
                  name
                  ofType {
                    kind
                    name
                  }
                }
              }
            }
          }
        }
      }
    }

Query to fetch the releaseDates:

{
  title(id:"tt0120737") {
    releaseDates(first: 9999) {
      edges {
        node {
          country {
            id
            text
          }
          day
          month
          year
          attributes {
            id
            text
          }
        }
      }
        }
  }
}
tboothman commented 1 year ago

Imdb have blocked the __schema {} request that lets you fetch everything about the endpoint and makes the documentation and query helpers work for graphiql. I'm going to see if I can make an endpoint/proxy that'll let you get the full schema so graphiql can help you type out queries. It's pretty slow work fetching the schema for each subobject and typing out the query without any editor support.

duck7000 commented 1 year ago

@tboothman Pardon my ignorance but that api has no public entrance so how can you get any data from it? Are you using your personal account to figure this out? for any user of this library this is important to know so i don't ask this for myself.

tboothman commented 1 year ago

It is a public api. It's what the frontend of the website is using to fetch data to render the page. For example, if you go to https://www.imdb.com/title/tt0120737/releaseinfo and press '50 more' under release date it does a graphql query to fetch some more data (which is where I saw the response shape and guessed the request shape for the query I put above)

tboothman commented 1 year ago

I've made a proxy that makes graphiql work locally so you can browse around the imdb API. It's pretty incredible how many types they've got (~1000). I imagine you can fetch a good portion of all the data from there now.

https://github.com/tboothman/imdbphp/commit/4b906af775232f91a1882393a8af6ae3a282c1e1

  1. Get GraphiQL (the chrome extension is nice) https://chrome.google.com/webstore/detail/graphiql-extension/jhbedfdjpmemmbghfecnaeeiokonjclb?hl=en
  2. Checkout the graphql branch and go into the graphql folder
  3. composer install (get composer if you haven't https://getcomposer.org/download/)
  4. php -S localhost:5000
  5. Go to http://localhost:5000 in the GraphiQL extension
  6. Have a look in the docs and see what's possible. Try the query I've put in here and see what else you can find
tboothman commented 1 year ago

Wrote a bit of code to get releaseInfo using graphql. It just has a big string for the query in the code, which works fine but basically relies on you writing the query out using graphiql. https://github.com/tboothman/imdbphp/blob/graphql/src/Imdb/Title.php#L2591 Also tried out a library that will build a load of classes for you so you can 'safely' query the schema because it knows about all the objects available. It works well enough but it's a bit verbose and harder to read than the raw graphql query. It doesn't tell you what each field is either, even though that's available in the schema (graphiql shows it) https://github.com/tboothman/imdbphp/blob/graphql/graphql/q.php#L15

tboothman commented 1 year ago

Bit of a grind but nearly there. Still two methods to fix but i'm not particularly inclined to look at soundtracks and i'm not entirely sure what the results of the methods that use parse_extcontent was. https://github.com/tboothman/imdbphp/pull/293

duck7000 commented 1 year ago

@tboothman Well i think that the users here will appreciate all your work, it looks impressive! But it will make this library more complicated and harder to understand in my view. And is it still necessary to install a graphiQL browser extension? if so than this will be one more dependency for this library Just my thoughts though

tboothman commented 1 year ago

Can't say that I agree. There's no new dependencies other than a new class to do graphql requests and cache them to disk.

Calling a graphqlapi is so much simpler than parsing html. Take https://github.com/tboothman/imdbphp/pull/293/commits/e36653feb84c9a2b5dedcafde4a4ffc52d62f558#diff-2a4bb245c110c1fe94bfeb5c0ae91d4435d947b30473ad5acabd1366a77b1776L578 for example. movie_recommendations was an indecipherable series of xpath queries and regexes, now it's a some nested field names and a loop to turn it into the right results shape.

Even if using graphql was a bad idea it's impossible to get some of the data off the page any more. All of the pages that broke changed to loading the first 5 or so items and having a load more button. e.g. https://www.imdb.com/title/tt0133093/movieconnections/ https://www.imdb.com/title/tt0133093/externalsites/

jreklund commented 1 year ago

Wow - that's quite a change in the amount of code that is needed. Need to read up on GraphQL some day, as I have never used it before.

tboothman commented 1 year ago

https://github.com/tboothman/imdbphp/releases/tag/v8.0.0