Closed mam4dali closed 1 year ago
Yeah, they have changed multiple pages to their new format. You can also get the old one sometimes, so they are still rolling them out on all servers. So if you change the code for their new design and re-run the test, you get old one and it fails.
We are either stuck with limited data (what is shown in the picture) or someone need to "crack" how their graphql works, as it's the only way to get the rest of the data. You need to be sending an correct variables and extensions or it will fail, so they do a check for bots.
https://caching.graphql.imdb.com/?operationName=TitleReleaseDatesPaginated&variables={data}&extensions={data}
Content-Type: application/json
Just to keep everything in one place, this also affects Companies page. Same thing.
It cannot get release info.
I have spent the last few days testing and reviewing. These days I can get the information with the following commands Also, by changing the imdb id in the data, everything works correctly However, I think this way should be further investigated I am confused about sha256Hash value
curl --location -g --request GET "https://caching.graphql.imdb.com/?operationName=TitleAkasPaginated&variables={\"after\":\"\",\"const\":\"tt15576994\",\"first\":200,\"locale\":\"en-US\",\"originalTitleText\":false}&extensions={\"persistedQuery\":{\"sha256Hash\":\"180f0f5df1b03c9ee78b1f410d65928ec22e7aca590e5321fbb6a6c39b802695\",\"version\":1}}" --header "authority: caching.graphql.imdb.com" --header "accept: application/graphql+json, application/json" --header "accept-language: en-US;q=0.9,en;q=0.8" --header "cache-control: no-cache" --header "content-type: application/json" --header "origin: https://www.imdb.com" --header "pragma: no-cache" --header "referer: https://www.imdb.com/" --header "user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36" --header "x-imdb-client-name: imdb-web-next-localized" --header "x-imdb-user-country: US" --header "x-imdb-user-language: en-US"
https://www.apollographql.com/docs/apollo-server/performance/apq/ Presumably that query will work using the hash until one day it doesn't any more.
I had a look at the graphql endpoint a while ago, and it was promising but didn't have all the information. Still, most the info from an endpoint that returns json is nicer than html parsing https://github.com/tboothman/imdbphp/issues/221#issuecomment-873419219
Knew it was coming, but I'm no longer receiving any company data, at all. Assuming the new page has rolled out completely, and though company info is the only thing I am affected by, I would assume it affects releases and akas and whatever else. Used to work after a refresh or two, but now, nothing.
Please someone should fix it.
Here is what I know, somebody can figure out the rest.
releaseinfo
page there are a .js file named releaseinfo-{hash}.js
TitleAkasPaginated
180f0f5df1b03c9ee78b1f410d65928ec22e7aca590e5321fbb6a6c39b802695
function y() {
var n = (0, s.Z)(["\n query TitleAkasPaginated($const: ID!, $first: Int!, $after: ID) {\n title(id: $const) {\n akas(\n first: $first\n after: $after\n sort: { by: COUNTRY, order: ASC }\n ) {\n ...AkaItem\n }\n }\n }\n ", "\n"]);
return y = function() {
return n
}, n
}
function B(e, t) {
var n = k(e),
r = e.context.url;
if (!n || !t) return r;
var i = [];
t.operationName && i.push("operationName=" + encodeURIComponent(t.operationName)), t.query && i.push("query=" + encodeURIComponent(t.query.replace(/#[^\n\r]+/g, " ").trim())), t.variables && i.push("variables=" + encodeURIComponent(N(t.variables))), t.extensions && i.push("extensions=" + encodeURIComponent(N(t.extensions)));
var o = r + "?" + i.join("&");
return o.length > 2047 ? (e.context.preferGetMethod = !1, r) : o
}
I guess it's just easier to store the sha256 hash instead of downloading this file, parsing it and generate a hash, just to send a cached request. If they change the query the hash will not match anymore anyway.
With all the changes done to Imdb maybe it's time to just close down chop and migrate to Tmdb.
Personally, I'd love to see someone crack this problem and get a fix in so we can continue with IMDb. I like TMDb, but they don't hold a candle to IMDb in terms of the data that they offer.
php-tmdb/api are the most up to date PHP version I know about and started to look at before I did the last round of fixes to imdbphp. There are some things not present on TMDB and vice versa. I got akas and parental guidance (exklusive) in my application.
The data to find the hash are all there, but I haven't seen the obvious bit where they actual do the computing. GenerateHash are there, but haven't seen the actual conversion. I have tried to pass the whole query into sha256 without luck, so I'm missing some data (or need to remove some).
But If we rely on calculating sha256 we need to grab those queries every time to have a valid hash. Or store them statically.
First we do a normal query to the website (or Graph?) and later on multiple queries to get all extra data so lot of re-writes need to happen.
We could go with limited data approach like #292 but I would remove that method from my application instead. All or nothing.
It's a bit tedious to explore the schema it's very possible to get this data from the graphql api Here's the schema for the Title object: Results of fetching the releasedates: Here's the query to fetch information about a GraphQL type:
query Type($type: String!) {
__type(name: $type) {
...FullType
}
}
fragment FullType on __Type {
kind
name
description
fields(includeDeprecated: true) {
name
description
args {
...InputValue
}
type {
...TypeRef
}
isDeprecated
deprecationReason
}
inputFields {
...InputValue
}
interfaces {
...TypeRef
}
enumValues(includeDeprecated: true) {
name
description
isDeprecated
deprecationReason
}
possibleTypes {
...TypeRef
}
}
fragment InputValue on __InputValue {
name
description
type { ...TypeRef }
defaultValue
}
fragment TypeRef on __Type {
kind
name
ofType {
kind
name
ofType {
kind
name
ofType {
kind
name
ofType {
kind
name
ofType {
kind
name
ofType {
kind
name
ofType {
kind
name
}
}
}
}
}
}
}
}
Query to fetch the releaseDates:
{
title(id:"tt0120737") {
releaseDates(first: 9999) {
edges {
node {
country {
id
text
}
day
month
year
attributes {
id
text
}
}
}
}
}
}
Imdb have blocked the __schema {} request that lets you fetch everything about the endpoint and makes the documentation and query helpers work for graphiql. I'm going to see if I can make an endpoint/proxy that'll let you get the full schema so graphiql can help you type out queries. It's pretty slow work fetching the schema for each subobject and typing out the query without any editor support.
@tboothman Pardon my ignorance but that api has no public entrance so how can you get any data from it? Are you using your personal account to figure this out? for any user of this library this is important to know so i don't ask this for myself.
It is a public api. It's what the frontend of the website is using to fetch data to render the page. For example, if you go to https://www.imdb.com/title/tt0120737/releaseinfo and press '50 more' under release date it does a graphql query to fetch some more data (which is where I saw the response shape and guessed the request shape for the query I put above)
I've made a proxy that makes graphiql work locally so you can browse around the imdb API. It's pretty incredible how many types they've got (~1000). I imagine you can fetch a good portion of all the data from there now.
https://github.com/tboothman/imdbphp/commit/4b906af775232f91a1882393a8af6ae3a282c1e1
Wrote a bit of code to get releaseInfo using graphql. It just has a big string for the query in the code, which works fine but basically relies on you writing the query out using graphiql. https://github.com/tboothman/imdbphp/blob/graphql/src/Imdb/Title.php#L2591 Also tried out a library that will build a load of classes for you so you can 'safely' query the schema because it knows about all the objects available. It works well enough but it's a bit verbose and harder to read than the raw graphql query. It doesn't tell you what each field is either, even though that's available in the schema (graphiql shows it) https://github.com/tboothman/imdbphp/blob/graphql/graphql/q.php#L15
Bit of a grind but nearly there. Still two methods to fix but i'm not particularly inclined to look at soundtracks and i'm not entirely sure what the results of the methods that use parse_extcontent was. https://github.com/tboothman/imdbphp/pull/293
@tboothman Well i think that the users here will appreciate all your work, it looks impressive! But it will make this library more complicated and harder to understand in my view. And is it still necessary to install a graphiQL browser extension? if so than this will be one more dependency for this library Just my thoughts though
Can't say that I agree. There's no new dependencies other than a new class to do graphql requests and cache them to disk.
Calling a graphqlapi is so much simpler than parsing html. Take https://github.com/tboothman/imdbphp/pull/293/commits/e36653feb84c9a2b5dedcafde4a4ffc52d62f558#diff-2a4bb245c110c1fe94bfeb5c0ae91d4435d947b30473ad5acabd1366a77b1776L578 for example. movie_recommendations was an indecipherable series of xpath queries and regexes, now it's a some nested field names and a loop to turn it into the right results shape.
Even if using graphql was a bad idea it's impossible to get some of the data off the page any more. All of the pages that broke changed to loading the first 5 or so items and having a load more button. e.g. https://www.imdb.com/title/tt0133093/movieconnections/ https://www.imdb.com/title/tt0133093/externalsites/
Wow - that's quite a change in the amount of code that is needed. Need to read up on GraphQL some day, as I have never used it before.
Hello Today I noticed some problems that make this library not work properly After checking, I noticed that a new format can be seen on the release information page It is different from the previous structure and causes it to not work properly pic:
html source: html source.txt
I think we should adjust to the new structure soon