Metaculus fetcher improvements

berekuk commented 2 years ago

validate API responses with https://ajv.js.org/
CLI option --id NNN for testing
parse description from window.metacData.question (which includes the correct markdown, AFAICT)

On the last point: I'd prefer to take the description from API, but Metaculus API doesn't return the description even on specific question endpoints, e.g. in https://www.metaculus.com/api2/questions/10069/ both description and description_html are empty (it's not the case for all questions, I'm not sure what causes it); @NunoSempere, do you know anything about that and have you contacted Metaculus re: fixing it?

Obtaining the description from https://www.metaculus.com/api2/questions/ would be ideal, that would speed up metaculus fetcher significantly (20x fewer calls). Maybe we can convince Metaculus to expose the description as a field there, maybe behind a flag?

vercel[bot] commented 2 years ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Updated
metaforecast	✅ Ready (Inspect)	Visit Preview	May 30, 2022 at 8:40PM (UTC)

NunoSempere commented 2 years ago

I reached out to someone at metaculus about this, I'll see what he answers

hnykda commented 2 years ago

Hey :wave: . I am happy to help.

First of all, sorry for the API being non-user-friendly. We are aware of it and have plans to improve it this year, but it hasn't gotten to the priority list yet. That's also the reason why we don't call it "public" API.

Obtaining the description from https://www.metaculus.com/api2/questions/ would be ideal, that would speed up metaculus fetcher significantly (20x fewer calls). Maybe we can convince Metaculus to expose the description as a field there, maybe behind a flag?

Yeah, that endpoint is already pretty heavy (that's why the default limit is 20), so adding additional data is not a good idea on our side atm. Is the speed an issue for you, or it "just" takes longer? Definitely possible to do something about it.

On the last point: I'd prefer to take the description from API, but Metaculus API doesn't return the description even on specific question endpoints, e.g. in https://www.metaculus.com/api2/questions/10069/ both description and description_html are empty (it's not the case for all questions, I'm not sure what causes it); @NunoSempere, do you know anything about that and have you contacted Metaculus re: fixing it?

These are valid from Metaculus POV - you hit a so-called subquestion of recently introduced question groups. The example you found has page_url https://www.metaculus.com/questions/10069/may-14/ which will redirect you to https://www.metaculus.com/questions/9866/flu-hospitalizations-for-wy/?sub-question=10069

Do you need any more of a documentation, i.e. would it be helpful to add some pieces/attributes/fields to metaforecast? We can provide that for sure. Otherwise the schema looks valid :+1: .

cc other Metaculus devs @rakyi @Matoo125

berekuk commented 2 years ago

Is the speed an issue for you, or it "just" takes longer?

Not an issue, I don't mind doing 1+20 requests. Overall compute load for metaculus will be the same or higher, but if it matters for you that any single request is not too heavy then I'll just keep doing it the old way.

These are valid from Metaculus POV - you hit a so-called subquestion of recently introduced question groups.

Oh, I see! What's the proper way to get the description for such subquestion then? Right now we're scraping the description from the html page, but that's too fragile.

I see that https://www.metaculus.com/api2/questions/10069/ belongs to the https://www.metaculus.com/api2/questions/9866/ group, and the description is exposed there, but there's no 9866 id on 10069 json fields.

Alternatively, a way to detect whether a question is a subquestion would work too, then we'd just skip subquestions and take all necessary data from the group's sub_questions field. (I guess I could rely on the description field being emtpy, but that's too indirect, and I'd prefer a different type value, or belongs_to_group field, or something like that.)

hnykda commented 2 years ago

Hey @berekuk .

Right now we're scraping the description from the html page, but that's too fragile.

This is not a good idea :smile:

Based on our analytics, you got throttled about 50k times since yesterday... You should really check for HTTP 429 and wait the necessary time before doing next requests (+ having something like an exponential backoff wouldn't be a bad idea too, we see that you are hitting the same URL e.g. 10 times in 5 seconds while you get an info about getting throttled for 2k seconds). But maybe you are not being ratelimited on something you care about? Anyway, I really don't think you should scrap HTML and we are willing to do any changes so you can use the API (but I think you should be able to get everything from it).

Why are you scrapping HTML in the first place? You can simply get the description from the API on the question's detail, no?

What's the proper way to get the description for such subquestion then?

Eh, there is "none" (and that's expected). The subquestion's is fully defined (in the domain) by its group's description and the sub_question's title. So e.g. the group https://www.metaculus.com/api2/questions/9866/ has a description, while https://www.metaculus.com/api2/questions/10069/ only has a Title "May 13". This is all you need.

I see that https://www.metaculus.com/api2/questions/10069/ belongs to the https://www.metaculus.com/api2/questions/9866/ group, and the description is exposed there, but there's no 9866 id on 10069 json fields. [...] Alternatively, a way to detect whether a question is a subquestion would work too, then we'd just skip subquestions and take all necessary data from the group's sub_questions field. (I guess I could rely on the description field being emtpy, but that's too indirect, and I'd prefer a different type value, or belongs_to_group field, or something like that.)

We added a new group: Optional[int] field which is null if the question is not in a group ("not a subquestion") and contains ID of the parent group if it has an ID in it. So this way you can distinguish sub_questions in the listing.

NunoSempere commented 2 years ago

You should really check for HTTP 429 and wait the necessary time before doing next requests (+ having something like an exponential backoff wouldn't be a bad idea too,

But we are doing this: https://github.com/quantified-uncertainty/metaforecast/blob/master/src/backend/platforms/metaculus.ts#L32=

we see that you are hitting the same URL e.g. 10 times in 5 seconds while you get an info about getting throttled for 2k seconds)

I'm not completely sure about this, but this could be someone else?

NunoSempere commented 2 years ago

Why are you scrapping HTML in the first place?

Arg, because we were fetching questions from endpoints like https://www.metaculus.com/api2/questions/?limit=20&offset=20, which don't have question descriptions, and I wasn't aware that endpoints such as https://www.metaculus.com/api2/questions/46/, which do have question descriptions, existed. Thanks for pointing this out.

hnykda commented 2 years ago

I'm not completely sure about this, but this could be someone else?

After a deeper analysis, we found out why it's happening (rendering the webpage via chromderiver triggers some additional javascript that fires some additional API calls that are needed to render the site). So it's you, but nothing you should worry about.

I wasn't aware that endpoints such as https://www.metaculus.com/api2/questions/46/, which do have question descriptions, existed. Thanks for pointing this out.

Yeah, no surprise, given there is no documentation :see_no_evil: . Couldn't you use that instead, please? It will be probably much safer (stable), simpler and faster. (and would solve the above as well)

hnykda commented 2 years ago

(I thought you are using API for detail as well, as I saw https://github.com/quantified-uncertainty/metaforecast/pull/84/files#diff-26fd2c1bb11d40151d392ec7720000dc3f1fbc6b66a314f8a7818c10eb77a328R224)

berekuk commented 2 years ago

I thought you are using API for detail as well

Yes, but that's new in this PR which is not merged yet, and in this PR I still had to fetch html due to the subquestion issue; now that you added a group field we can avoid html scraping altogether. Thanks!

berekuk commented 2 years ago

Updates in the last commit:

take description from API
ignore subquestions (questions with non-empty group)
split question groups into multiple metaforecast questions
- description is identical
- urls are generated with ?sub-question=ID param
- titles are set to group title (subquestion title)

Note that we could also join binary subquestions in a single categorical question on metaforecast, but my code currently doesn't do this.

The problem is that question groups are not always mutually exclusive and they are too flexible by design, so this could cause us problems in the future, e.g. resolution times wouldn't be a coherent concept since metaculus subquestions can have different resolve times.

Though I don't like that some search queries will return a lot of similar results. @NunoSempere, what do you think?

For reference, here's an example of a large question group with mutually exclusive outcomes: https://www.metaculus.com/questions/11005/winner-of-2022-fifa-world-cup/

And here's an example of a question group with non-exclusive outcomes: https://www.metaculus.com/questions/9861/question-groups-all-question-types-supported/, image under "Choices, choices" heading (the question for it doesn't exist or isn't published).

hnykda commented 2 years ago

Yep, that sounds like a good solution to me, thanks! :+1: Has that been deployed?

I agree on the mapping of question groups to multiple-choice - question groups are indeed more generic and you don't have a way of saying which one is which at the moment (and as you can see, there are logical inconsistencies already e.g. on that FIFA question).

berekuk commented 2 years ago

Has that been deployed?

Now it has been. Next fetch cycle should happen in the next 24 hours.

NunoSempere commented 2 years ago

Though I don't like that some search queries will return a lot of similar results

Yeah, I don't like this either. Ideally we would group together the questions that should be together, have a type of question that doesn't sum up to 100%, etc. But it's ok for now

quantified-uncertainty / metaforecast

Metaculus fetcher improvements #84