Process lull - Githubissues

tarrade commented 4 years ago

I see a lot but only in the total dataset, why ? Is it because some text are too long ?

Can you rerun a test job on 10'000 events with defaults VM disk size (not 60 GB, I think it will be 250 GB) ? I added new metric to count the number of error or warning in the logs

gprinz commented 4 years ago

@tarrade Is in progress... I had no time to do this yesterday, because I was sick.

tarrade commented 4 years ago

sure, no problem. I hope you fully recovered.

I will be out soon. I added few more monitoring plots. I got an answer from the Dataflow project manager.

Are you doing some groupBy operation in our pipeline ? I don't think so.

gprinz commented 4 years ago

No, the pipeline makes not use of any grouping operation. Let's see what happens and then re-run the pipeline with the default disk space of 250GB as recommended by Google.

tarrade commented 4 years ago

ok and yes let see with the 250 GB per VM and 25 TB disk

gprinz commented 4 years ago

This time, job looks much better. So far, we have non processing lull warnings. Furthermore, last time the job metrics changed a lot whereas they are stable this time.

tarrade commented 4 years ago

Yes, this look way better. Just before 12:00 something happen but let see.

gprinz commented 4 years ago

Looks bad, the metrics started jumping around. See:

new_dataflow_error

new_dataflow_error_2

gprinz commented 4 years ago

the second screenshot has been made shortly after the first. I will stop the process. Do you agree?

tarrade commented 4 years ago

Let wait one or 2 hours and see what we do. It could be that we always read the data in the same way and we have very long text here.

We also need more info to debug and see if the job failed or not.

gprinz commented 4 years ago

@tarrade Sorry, I have seen your post too late and have stopped the job. I'm sure, the job would have failed at the end, because more and more lull errors were thrown in the last few minutes and the metrics started jumping around. I further think, that the difference in the metrics can be explained by different post lengths. It rather looks for me like a data loss or something similar occuring at some point...

gprinz commented 4 years ago

I had a short discussion with Arthur and the problem is most probably the following:

Our logs show that we have a processing lull in the following lines of code (maker in bold):

def __lemmatization(self, input_str: str) -> str: doc = self.__spacy(input_str) tokens = [token.lemma_.lower() for token in doc if self.__spacy_cleaning(token)] return ' '.join(tokens)

As soon as DataFlow cannot further process some data, it throws away some part of the data and restarts the job on this part of the data. This explains why the metrics (elements processed so far, total amount of vCPU time, etc.) start jumping around and the job fails at some point.

Most likely, we have some input strings somwehere in the data, spacy cannot process and thus causes our problems. This also explains why the job runs successfully when applied on a subset of the data.

Accordingly, I will try to find out which data/input strings causes the problems and try to fix the problem.

tarrade commented 4 years ago

I think the issue could be with with SpaCy but if you look at the logs, you will see that the issue with lull happen for many Spacy functions, not only the join. I looked in the previous job and I didn't see a unique function. Here I didn't look at them

The second things that puzzle me is why we have nothing for 2 hours and then start the problem.

Can you try to select a small dataset with the 1'000 longest posts ? Maybe here we have more chance to find the issue.

tarrade commented 4 years ago

One other option is to add a print at the begining of the NPL function and print the BigQuery index/id so offline we can process this example and see the pb. In this case we will need to reprocess the job until we see the issues appearing. (1-2h)

gprinz commented 4 years ago

The warning is always thrown when calling doc = self.__spacy(input_str). The other lines in the call stack just show us where the problem inside spacy occurs.

tarrade commented 4 years ago

Ok, let review the full NLP code next week to be sure it works and all cases and output what we expect. With this, we will be sure that Spacy will not be responsible for some future issue.

gprinz commented 4 years ago

I have applied the pipeline on the 10'000 longest posts and it finished successfully. Accordingly, the problem must be somewhere else. The conclusions we can draw from our testings:

monitoring DataFlow jobs is essential
implement some mechanism to find problematic data quickly

Let's do that next Monday and we will find quickly the data letting our pipeline failing.

tarrade commented 4 years ago

Yes, let review everything on Monday. Happy week end

tarrade commented 4 years ago

"Processing lull" is not an error - it's just debugging information supplied by Dataflow to help you debug your slow DoFn's. It's shown if a DoFn is processing an element for more than 5 minutes (I think). If the DoFn is expected to be slow, you can ignore this message. If it's not expected - the processing lull message tells you exactly what the DoFn is currently doing so you can debug it.

tarrade commented 4 years ago

The issue was related to memory leakage. With SpaCy 2.1.8 we start to have process lull after 8M events. With the latest version 2.2.3, no use so far with 14M events processed. Let see if we can reach our 35M elements. Closing for now

tarrade / proj_NLP_text_classification_with_GCP

Process lull #24