Error management on random-connector-incremental

matteogrolla commented 4 years ago

Hi, I'm Matteo Grolla from Sourcesense, Lucidwork's partner in Italy. I'm developing a custom connector for a customer but I have questions about error management, Robert Lucarini suggested to post my questions here. Let's use random-content-incremental for our discussion and let's focus on the fetch method What I've noticed is:

if an exception is thrown inside generateRandom the framework restarts the crawl from previous checkpoint (or the beginning if it was the first) How can I terminate the crawl marking it as failed? I'd like that next time I restart the crawl it proceeds from last saved checkpoint
if an exception is thrown inside emitDocument the framework logs the error and proceeds with the crawl. Will this document be recrawled? When? Can we control this? Thanks a lot

roblucar commented 4 years ago

Hi @matteogrolla , Thank you for posting your questions.

The crawlDB will manage the state of the Job runs. In particular, the BlockID

identifies a series of 1 or more Jobs, and the lifetime of a BlockId spans from the start of a crawl to the crawls completion.When a Job starts and the previous Job did not complete (failed or stopped), the previous Job’s BlockId is reused. The same BlockId will be reused until the crawl successfully completes.BlockIds are used to quickly identify items in the CrawlDb which may not have been fully processed (complete). Which addresses the restart. Unfortunately there is no way at the time to programmatically stop a crawl job like when a user or external process initiates a stop through the Fusion UI or API.
Yes, the document will be marked as failed and retried on the successive crawl job.

mwmitchell commented 4 years ago

Hi @matteogrolla,

For this one:

if an exception is thrown inside generateRandom the framework restarts the crawl from previous checkpoint (or the beginning if it was the first) How can I terminate the crawl marking it as failed? I'd like that next time I restart the crawl it proceeds from last saved checkpoint

Are you saying that you'd like the job to stop immediately, due to the exception that was thrown?

Will this document be recrawled? When? Can we control this?

Do you have another way you'd like errors to behave?

matteogrolla commented 4 years ago

Hi @mwmitchell, this is a closed source framework of an established product, so I believed I'd find a paragraph in the documentation describing how to deal with the different kind of exceptions, but the only example is a nullpointer exception. Anyway, since you ask, I try to approach the subject in general and then describe some practical scenarios that I have to deal with.

In the context of a batch job errors can be partitioned in

Unretriable Errors: (will never work, don't bother retry) when they arise the failed operation should be logged and if possible the crawl should continue (otherwise it should stop)
Retriable Errors: (may work on next attempt) when they arise the failed operation should be retried a certain number of times (maybe undefinitely) and if it still fails, should be logged. Then if possible the crawl should continue (otherwise it should stop) The attempts necessary to succeed can be many and it may be usefull to stop the crawl and restart it when it can proceed successfully. (Maybe fusion needs maintenance and must be restarted)

Most errors should be thrown during the communication with the documents source (a web service, a mail server...), but if I'm not wrong the connector framework is a distributed system, so even fetchContext's emits are not error free and I'd like to understand what happens when these errors arise.

Here are some practical scenarios that I have to deal with

-scenarios A source system goes offline (retriable exception needing many retries) -- scenario A1: (I've understood how to implement it) connector: asks ids of docs published on 2020-01-01 source: returns doc ids connector: emits those ids in the fetchcontext as transient and checkpoints 2020-01-01 source: GOES OFFLINE connector: keeps trying fetching ids for 2020-01-02 tries fetching docs body for ids in the fetchcontext both requests fail requests are retried endlessly

next morning source: GOES ONLINE connector bodies of docids for 2020-01-01 are fetched doc ids for 2020-01-02 are fetched the crawl proceeds

QUESTION: what happens if the crawl is stopped when source is offline? and maybe fusion is restarted? In randomContentIncremental docIds are emitted as TRANSIENT candidate, and I don't know what that transient means

-- scenario A2: a proposal connector: asks ids of docs published on 2020-01-01 source: returns doc ids connector: emits those ids in the fetchcontext as transient and checkpoints 2020-01-01 source: GOES OFFLINE connector: keeps trying fetching ids for 2020-01-02 tries fetching docs body for ids in the fetchcontext both requests fail crawl is STOPPED with (for example) fetchContext.stopCrawl()

next morning someone (or maybe a scheduler) restarts the crawl source: GOES ONLINE connector: bodies of docids for 2020-01-01 are fetched doc ids for 2020-01-02 are fetched the crawl continues

-scenario B wrong request to source system (unretriable exception that shuld stop the crawl) user: specifies a batch size too large connector: asks a large batch of doc ids source: fails connector: stops the crawl with fetchContext.stopCrawl()

scenario C doc is deleted between fetch id and fetch body (unretriable exception that lets crawl proceed) connector: fetches id of doc1 from source system user: deletes doc1 from source system connector: tries fetching body of doc1 logs error and proceeds (I'd like at least the number of errors to be visible in the UI at the end of the crawl, so the exception should reach the framework and not just logged by custom code)

QUESTION: I don't understand the responsibility of fetchContext.newResult() I believed it was "we are done with this input, let's continue with next" In randomContentIncremental this doesn't work if the input triggers emitDocument (the else part, line 61) input triggers emitDocument emitDocument may throw exception fetchContext.newResult() is never reached but we continue anyway with next input

lucidworks / connectors-sdk-resources

Error management on random-connector-incremental #60