lucidworks / connectors-sdk-resources

Fusion Connector SDK documentation, examples and related resources
16 stars 11 forks source link

Error management on random-connector-incremental #60

Open matteogrolla opened 4 years ago

matteogrolla commented 4 years ago

Hi, I'm Matteo Grolla from Sourcesense, Lucidwork's partner in Italy. I'm developing a custom connector for a customer but I have questions about error management, Robert Lucarini suggested to post my questions here. Let's use random-content-incremental for our discussion and let's focus on the fetch method What I've noticed is:

roblucar commented 4 years ago

Hi @matteogrolla , Thank you for posting your questions.

  1. The crawlDB will manage the state of the Job runs. In particular, the BlockID

    identifies a series of 1 or more Jobs, and the lifetime of a BlockId spans from the start of a crawl to the crawls completion.When a Job starts and the previous Job did not complete (failed or stopped), the previous Job’s BlockId is reused. The same BlockId will be reused until the crawl successfully completes.BlockIds are used to quickly identify items in the CrawlDb which may not have been fully processed (complete). Which addresses the restart. Unfortunately there is no way at the time to programmatically stop a crawl job like when a user or external process initiates a stop through the Fusion UI or API.

  2. Yes, the document will be marked as failed and retried on the successive crawl job.
mwmitchell commented 4 years ago

Hi @matteogrolla,

For this one:

if an exception is thrown inside generateRandom the framework restarts the crawl from previous checkpoint (or the beginning if it was the first) How can I terminate the crawl marking it as failed? I'd like that next time I restart the crawl it proceeds from last saved checkpoint

Are you saying that you'd like the job to stop immediately, due to the exception that was thrown?

Will this document be recrawled? When? Can we control this?

Do you have another way you'd like errors to behave?

matteogrolla commented 4 years ago

Hi @mwmitchell, this is a closed source framework of an established product, so I believed I'd find a paragraph in the documentation describing how to deal with the different kind of exceptions, but the only example is a nullpointer exception. Anyway, since you ask, I try to approach the subject in general and then describe some practical scenarios that I have to deal with.

In the context of a batch job errors can be partitioned in

Most errors should be thrown during the communication with the documents source (a web service, a mail server...), but if I'm not wrong the connector framework is a distributed system, so even fetchContext's emits are not error free and I'd like to understand what happens when these errors arise.

Here are some practical scenarios that I have to deal with

-scenarios A source system goes offline (retriable exception needing many retries) -- scenario A1: (I've understood how to implement it) connector: asks ids of docs published on 2020-01-01 source: returns doc ids connector: emits those ids in the fetchcontext as transient and checkpoints 2020-01-01 source: GOES OFFLINE connector: keeps trying fetching ids for 2020-01-02 tries fetching docs body for ids in the fetchcontext both requests fail requests are retried endlessly

next morning source: GOES ONLINE connector bodies of docids for 2020-01-01 are fetched doc ids for 2020-01-02 are fetched the crawl proceeds

QUESTION: what happens if the crawl is stopped when source is offline? and maybe fusion is restarted? In randomContentIncremental docIds are emitted as TRANSIENT candidate, and I don't know what that transient means

-- scenario A2: a proposal connector: asks ids of docs published on 2020-01-01 source: returns doc ids connector: emits those ids in the fetchcontext as transient and checkpoints 2020-01-01 source: GOES OFFLINE connector: keeps trying fetching ids for 2020-01-02 tries fetching docs body for ids in the fetchcontext both requests fail crawl is STOPPED with (for example) fetchContext.stopCrawl()

next morning someone (or maybe a scheduler) restarts the crawl source: GOES ONLINE connector: bodies of docids for 2020-01-01 are fetched doc ids for 2020-01-02 are fetched the crawl continues

-scenario B wrong request to source system (unretriable exception that shuld stop the crawl) user: specifies a batch size too large connector: asks a large batch of doc ids source: fails connector: stops the crawl with fetchContext.stopCrawl()

QUESTION: I don't understand the responsibility of fetchContext.newResult() I believed it was "we are done with this input, let's continue with next" In randomContentIncremental this doesn't work if the input triggers emitDocument (the else part, line 61) input triggers emitDocument emitDocument may throw exception fetchContext.newResult() is never reached but we continue anyway with next input