matchish / laravel-scout-elasticsearch

Search among multiple models with ElasticSearch and Laravel Scout
MIT License
713 stars 115 forks source link

[BUG] scout:import starting two processes with the same id, one always failing #190

Closed LMCom closed 2 years ago

LMCom commented 2 years ago

Version Laravel: 8.62.0 Scout: 9.2.10 Package: 5.0.1

Problem I get entries in the table failed_jobs like Illuminate\Queue\MaxAttemptsExceededException: Matchish\ScoutElasticSearch\Jobs\QueueableJob has been attempted too many times or run too long. I don't think it's actually a timeout, because payload contains timeout: null and failOnTimeout: false.

I spent hours searching for the cause and even logged the responses of Elasticsearch. Looking into worker.log, which is maintained by pm2, I see:

[2022-01-28 09:47:20][izYkReOaDX7UOzCmr4t21lHLOjvUIPj3] Processing: Matchish\ScoutElasticSearch\Jobs\QueueableJob
[2022-01-28 09:48:50][izYkReOaDX7UOzCmr4t21lHLOjvUIPj3] Processing: Matchish\ScoutElasticSearch\Jobs\QueueableJob
[2022-01-28 10:05:17][izYkReOaDX7UOzCmr4t21lHLOjvUIPj3] Processed:  Matchish\ScoutElasticSearch\Jobs\QueueableJob

It strikes me as odd that a process is started twice with the same id. The process started at 09:48:50 fails immediately (written to failed_jobs). I can see in pm2 monit that one queue worker is doing work until the log entry at 10:05:17 confirms that the job was processed. This also shows that there is no timeout problem and the job actually works.

I guess it's tough, even for you, to diagnose this behaviour. Could you at least point out, how I could debug this problem better and narrow down the cause?

LMCom commented 2 years ago

I don't want to bother you, but at least I would appreciate a hint how to debug this. As a workaround, I have to use single thread indexing and it's slowing things down a lot. A real shame, since the application is running on a 64-core Threadripper.

I also notice you seem to be from Kyiv? I hope you are not there right now and your loved ones are safe! Day after day, it's heartwrenching to see what happens in Ukraine. :( Our governments are NOT doing enough to stop this war and every citizen I know is willing to suffer inconveniences. It would be delusional to think we could have a war in Europe today without consequences for every single one of us.

matchish commented 2 years ago

Hey! First of all, thank you for your support💙💛 I really appreciate all the people who try to help in any way they can and do not stand aside🤗

İ would try to exclude some moving parts to narrow down the cause.

Do you have the same issues for other commands? İ mean the same process ids

Could you use something else instead of pm2?

İf the issue appears only for package specific command then you can try to disable some code into the command(you need to fork the package then)

matchish commented 2 years ago

Maybe the issue appears because of chaining jobs https://github.com/matchish/laravel-scout-elasticsearch/blob/8f0fffc318b527838de9a3712067cfcc21866ed3/src/Console/Commands/ImportCommand.php#L53

matchish commented 2 years ago

Also changes in pm2 configuration could help

matchish commented 2 years ago

https://github.com/matchish/laravel-scout-elasticsearch/issues/190#issuecomment-1100136780

You can try to send import job to queue directly without chaining (fork needed)

LMCom commented 2 years ago

Do you have the same issues for other commands? İ mean the same process ids

No.

Could you use something else instead of pm2?

No, too much dependencies. It is a very mature and stable process manager, I don't see a way how it could cause the problem.

Also changes in pm2 configuration could help

I will look into that, maybe there are rules for running/starting/stopping processes that affect this case.

Hey! First of all, thank you for your support💙💛 I really appreciate all the people who try to help in any way they can and do not stand aside🤗

You're welcome! It's a matter of course.

LMCom commented 2 years ago

Ok, I guess it's solved. It was a simple reason after all.

It always worked with 1 worker. With 2 workers, I got the following:

Worker 0
    [2022-04-20 08:52:37][dU53PW37lcxbwCXMXxIbkPhPzoGOev7c] Processing: Matchish\ScoutElasticSearch\Jobs\QueueableJob
    [2022-04-20 08:55:05][dU53PW37lcxbwCXMXxIbkPhPzoGOev7c] Processed:  Matchish\ScoutElasticSearch\Jobs\QueueableJob
Worker 1
    [2022-04-20 08:54:07][dU53PW37lcxbwCXMXxIbkPhPzoGOev7c] Processing: Matchish\ScoutElasticSearch\Jobs\QueueableJob
    [2022-04-20 08:54:07][dU53PW37lcxbwCXMXxIbkPhPzoGOev7c] Failed:     Matchish\ScoutElasticSearch\Jobs\QueueableJob

The process started by worker 1 is failing immediately. Notably, it always starts 90 seconds after worker 0 started the first process. I didn't know about the setting retry_after in the queue config, which was set to 90 seconds. And even after I found it, I interpreted it as: "retry a failed job after that number of seconds." Actually, as per the documentation, retry_afterwill even retry a job, while it is running. And if a job is running and retried, it fails immediately.

In your config/queue.php configuration file, each queue connection defines a retry_after option. This option specifies how many seconds the queue connection should wait before retrying a job that is being processed. For example, if the value of retry_after is set to 90, the job will be released back onto the queue if it has been processing for 90 seconds without being released or deleted. Typically, you should set the retry_after value to the maximum number of seconds your jobs should reasonably take to complete processing.

(https://laravel.com/docs/9.x/queues)

Why does it always work with exactly one worker? I guess, a worker has to be available to retry a job. The Laravel doc does not mention that retry_after is irrelevant (afaik), if only one worker is running.

Two questions: While investigating the code, I noticed that the import does not use parallelization. The job \Matchish\ScoutElasticSearch\Jobs\Import uses import stages from \Matchish\ScoutElasticSearch\Jobs\ImportStages::fromSource. There, the data is chunked and all chunks are processed in those stages. I. e. the chunks are not distributed over several jobs. Why?

And am I assuming correctly that $job = (new QueueableJob())->chain([$job]); is used to prevent an exception that would occur, when the job tried to output something to a non-existent output stream?

matchish commented 2 years ago

I don't remember details, but you're right Import was designed to run in parallel but at some point there were some issues with consistency because of overlapping pages, performance chunking, stages execution order, etc.

I've have some proof of concept here and it worked last time I checked (no progress report yet). So if you have time and willing, be my guest to try it) PR highly welcome as well. https://github.com/matchish/laravel-scout-elasticsearch/issues/78

Chaining jobs just a way to not have if and to work with any job in the same way. If it's not queueable job it will be executed immediately, otherwise will go to a queue

LMCom commented 2 years ago

performance chunking

I noticed, too, that the native chunk() of Laravel gets incrementally slower after each chunk. I had some use-cases where it became too slow to be useable. It has to do with how MySQL works, it seems.

I am overwhelmed by work, so I do not think I will contribute to the code anytime soon. But you may be interested in some quick solution I wrote to get performant chunking:

    public static function speedChunk($query, $chunkSize, $callback, $keyName = 'id')
    {
        $idQuery = clone $query;

        $ids = $idQuery->orderBy($keyName)
            ->pluck($keyName)
            ->nth($chunkSize);

        for($i = 0; $i < count($ids); ++$i) {
            $chunkQuery = (clone $query)->where($keyName, ($i ? '>' : '>='), $ids[$i]);

            if(isset($ids[$i + 1])) {
                $chunkQuery->where($keyName, '<=', $ids[$i + 1]);
            }

            $callback($chunkQuery->get());
        }
    }

(old code, without types ;) )

This does not get slower with each chunk.

matchish commented 2 years ago

Will compare my solution with this one when come back to parallel import. Thanks)

LMCom commented 2 years ago

IMHO, this issue could be closed.