quantified-uncertainty / metaforecast

Fetch forecasts from prediction markets/forecasting platforms to make them searchable. Integrate these forecasts into other services.
https://metaforecast.org/
MIT License
56 stars 5 forks source link

Incremental fetchers #91

Open berekuk opened 2 years ago

berekuk commented 2 years ago

This is a draft for #35 and #36, and it's not ready yet, but the changes are significant and I want to braindump my thoughts on it.

So, currently all platform modules fetch all questions and then store a huge array in the DB (and then on Algolia).

As I mentioned in #36, I'd like to change that.


Sidenote: I spent several hours today fighting the new metaculus fetcher which failed for one reason or another (mostly because of excessive validation, but also once because one question was on the frontpage and ON DELETE was set to restrict instead of cascade). Every time I had to wait until it got past the last point of failure, only to have it fail again further down the road.

I really don't like to have such a long feedback loop to get some initial results; also, the current architecture gets in the way when I want to get some questions in my dev DB. Though I've recently implemented the npm run cli metaculus -- --id=12345 command, what I really want is to say "fetch some stuff for this platform" without waiting several hours for the script to finish.

Of course, there are also other benefits for why I'm doing this; getting us closer to the real-time capabilities, etc.


The basic idea is: we crawl the graph of urls; there are some leaf nodes (question page urls or graphql endpoints with questions data or whatever) and some intermediate nodes which allow us to discover leaf nodes, e.g. /api2/questions/ on metaculus which doesn't give us full data but gives it us urls for other api pages with full data.

To store the progress we can use the table (Robot) with jobs as rows; each job includes an url, a json context, and some metadata for when the job was created and whether it was completed. Then we can encapsulate the common pattern of "keep fetching stuff until there's some stuff to process" behind a common API.

Here's a draft which uses this approach:

export const myPlatform = {
    ...,
    async fetcher({ robot, storage }) {
        await robot.schedule({
            url: 'https://www.metaculus.com/api2/questions/',
            context: {
                type: 'apiIndex',
            },
                        maxAge: 3600 * 24, // don't schedule if previous fetch happened recently
        });

        for(let job; job = await robot.nextJob(); ) {
            const result = await job.fetch();
            if (job.context.type === 'apiIndex') {
                const data = validate(result);
                for (const tmp of data) {
                    await robot.schedule({
                        url: tmp.url,
                        context: {
                            type: 'apiSingle',
                        },
                    });
                }
                if (data.next) {
                    await robot.schedule({
                        url: data.next,
                        context: {
                            type: 'apiIndex',
                        },
                    });
                }
            } else if (job.context.type === 'apiSingle') {
                const validated = validate(result);
                const question = resultToQuestion(validated);
                await storage.save(question);

                                // complete the job and create a new one; this is excessive in this case since we crawl the index api pages, but can be helpful in other cases
                await job.done({ repeatAfter: 86400 });
            } else {
                throw new Error("Unknown job type");
            }
        }
    }
};

Notes on this example:

In the future, we could also:

Stuff I'm still figuring out:

vercel[bot] commented 2 years ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated
metaforecast ✅ Ready (Inspect) Visit Preview Jun 3, 2022 at 5:57PM (UTC)
berekuk commented 2 years ago

Note on deletions.

Several possible solutions to consider:

  1. check every question once in a while to confirm that it's still alive; maybe have a separate function in the Platform API for that, checkQuestion
  2. have a separate function, listAllQuestions, which returns the list of all questions that shouldn't be deleted, and delete everything else; this is problematic because for some platforms this function could be as expensive as crawling everything
  3. delete all questions which weren't updated recently; this is easy but too dangerous, we might delete too much stuff by accident

I lean towards (1), though I don't like that it'll require a significant amount of new code.

berekuk commented 2 years ago

@NunoSempere I'd appreciate any feedback you have on this. I might be missing some corner cases, since I still haven't read the code for all the platforms carefully.

NunoSempere commented 2 years ago

Ok, looking at this, I don't understand what type of pattern the following is:

async fetcher({ robot, storage }) {
  ...
}

Should this be something like: async function fetcher?

No comments for now while I understand what the code is doing.

berekuk commented 2 years ago

It's a shorthand;

const obj = {
  foo: async () => {
  },
};

is the same as

const obj = {
  async foo() {
  },
};

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Functions/Method_definitions

NunoSempere commented 2 years ago

Ok, so looking through this I think I would have tended to do something much hackier, like saving the page for apis that implement pagination. Overall not sure how to judge this though; the approach is a bit more complicated and, as you mention, it will take some tweaks to make the robot conform to the different APIs of all the platforms.

NunoSempere commented 2 years ago

On question deletion, note that we do want to keep questions after they resolve, even if we don't show them in the frontpage.

berekuk commented 2 years ago

I would have tended to do something much hackier, like saving the page for apis that implement pagination

That would help with the interruptible metaculus fetcher, but the main reason for this PR is the future near-real-time capabilities, which are impossible to get with the current "once in 24 hours" approach.

On question deletion, note that we do want to keep questions after they resolve, even if we don't show them in the frontpage.

Right. Scenarios when deletion is necessary I can think of are:

NunoSempere commented 2 years ago

Makes sense