Incremental fetchers - Githubissues

berekuk commented 2 years ago

This is a draft for #35 and #36, and it's not ready yet, but the changes are significant and I want to braindump my thoughts on it.

So, currently all platform modules fetch all questions and then store a huge array in the DB (and then on Algolia).

As I mentioned in #36, I'd like to change that.

Sidenote: I spent several hours today fighting the new metaculus fetcher which failed for one reason or another (mostly because of excessive validation, but also once because one question was on the frontpage and ON DELETE was set to restrict instead of cascade). Every time I had to wait until it got past the last point of failure, only to have it fail again further down the road.

I really don't like to have such a long feedback loop to get some initial results; also, the current architecture gets in the way when I want to get some questions in my dev DB. Though I've recently implemented the npm run cli metaculus -- --id=12345 command, what I really want is to say "fetch some stuff for this platform" without waiting several hours for the script to finish.

Of course, there are also other benefits for why I'm doing this; getting us closer to the real-time capabilities, etc.

The basic idea is: we crawl the graph of urls; there are some leaf nodes (question page urls or graphql endpoints with questions data or whatever) and some intermediate nodes which allow us to discover leaf nodes, e.g. /api2/questions/ on metaculus which doesn't give us full data but gives it us urls for other api pages with full data.

To store the progress we can use the table (Robot) with jobs as rows; each job includes an url, a json context, and some metadata for when the job was created and whether it was completed. Then we can encapsulate the common pattern of "keep fetching stuff until there's some stuff to process" behind a common API.

Here's a draft which uses this approach:

export const myPlatform = {
    ...,
    async fetcher({ robot, storage }) {
        await robot.schedule({
            url: 'https://www.metaculus.com/api2/questions/',
            context: {
                type: 'apiIndex',
            },
                        maxAge: 3600 * 24, // don't schedule if previous fetch happened recently
        });

        for(let job; job = await robot.nextJob(); ) {
            const result = await job.fetch();
            if (job.context.type === 'apiIndex') {
                const data = validate(result);
                for (const tmp of data) {
                    await robot.schedule({
                        url: tmp.url,
                        context: {
                            type: 'apiSingle',
                        },
                    });
                }
                if (data.next) {
                    await robot.schedule({
                        url: data.next,
                        context: {
                            type: 'apiIndex',
                        },
                    });
                }
            } else if (job.context.type === 'apiSingle') {
                const validated = validate(result);
                const question = resultToQuestion(validated);
                await storage.save(question);

                                // complete the job and create a new one; this is excessive in this case since we crawl the index api pages, but can be helpful in other cases
                await job.done({ repeatAfter: 86400 });
            } else {
                throw new Error("Unknown job type");
            }
        }
    }
};

Notes on this example:

fetcher don't return anything, it calls storage.save instead
fetcher is completely interruptible, you should be able to ctrl+c it and restart it and it'll continue from the same point
fetcher exits when there are no jobs queued up but you can just restart it every minute and it'll queue up new stuff when it becomes necessary (previous discussion: #35)
fetcher could add different indices to the queue with the different maxAge values; e.g., it's easy to schedule a metaculus frontpage with a small maxAge and crawl urls from it more frequently
robot API could encapsulate sleeps and other common logic (not implemented yet)
storage.save will also update history and algolia synchronously, no need to do it in a separate step

In the future, we could also:

separate the robot and the platform code further, so that it'll be possible to run "re-process all the fetched data which is already cached"
pass a different storage to the fetcher, e.g. for debugging purposes you could console.log questions instead of storing them in the DB

Stuff I'm still figuring out:

need to pass platform-specific credentials to the robot somehow
not sure if maxAge and repeatAfter is the right approach, still experimenting with this
need a mode for force-fetching even if url was fetched recently; this can be hacked with DELETE from "Robot" WHERE platform = "myplatform", not sure if we need anything more
how to deploy this (maybe it's a good time to move away from Heroku to a separate DO instance)
deletions are problematic (I'll expand on this in a comment)

vercel[bot] commented 2 years ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Updated
metaforecast	✅ Ready (Inspect)	Visit Preview	Jun 3, 2022 at 5:57PM (UTC)

berekuk commented 2 years ago

Note on deletions.

Several possible solutions to consider:

check every question once in a while to confirm that it's still alive; maybe have a separate function in the Platform API for that, checkQuestion
have a separate function, listAllQuestions, which returns the list of all questions that shouldn't be deleted, and delete everything else; this is problematic because for some platforms this function could be as expensive as crawling everything
delete all questions which weren't updated recently; this is easy but too dangerous, we might delete too much stuff by accident

I lean towards (1), though I don't like that it'll require a significant amount of new code.

berekuk commented 2 years ago

@NunoSempere I'd appreciate any feedback you have on this. I might be missing some corner cases, since I still haven't read the code for all the platforms carefully.

NunoSempere commented 2 years ago

Ok, looking at this, I don't understand what type of pattern the following is:

async fetcher({ robot, storage }) {
  ...
}

Should this be something like: async function fetcher?

No comments for now while I understand what the code is doing.

berekuk commented 2 years ago

It's a shorthand;

const obj = {
  foo: async () => {
  },
};

is the same as

const obj = {
  async foo() {
  },
};

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Functions/Method_definitions

NunoSempere commented 2 years ago

Ok, so looking through this I think I would have tended to do something much hackier, like saving the page for apis that implement pagination. Overall not sure how to judge this though; the approach is a bit more complicated and, as you mention, it will take some tweaks to make the robot conform to the different APIs of all the platforms.

NunoSempere commented 2 years ago

On question deletion, note that we do want to keep questions after they resolve, even if we don't show them in the frontpage.

berekuk commented 2 years ago

I would have tended to do something much hackier, like saving the page for apis that implement pagination

That would help with the interruptible metaculus fetcher, but the main reason for this PR is the future near-real-time capabilities, which are impossible to get with the current "once in 24 hours" approach.

On question deletion, note that we do want to keep questions after they resolve, even if we don't show them in the frontpage.

Right. Scenarios when deletion is necessary I can think of are:

we identify a question by its url and the platform changes the url due to a typo or something (causes a duplicate, not a big deal if we clean it up eventually)
platform posts a question by accident and revokes it
someone on an open platform posts something inappropriate and the platform admins delete it

NunoSempere commented 2 years ago

Makes sense

quantified-uncertainty / metaforecast

Incremental fetchers #91