lapps-xdd / xdd-processing

0 stars 0 forks source link

Odd processing times on COVID data #8

Open marcverhagen opened 3 months ago

marcverhagen commented 3 months ago

Running 10-15K documents through the preprocessing chain taken a couple of hours (most of the time on spaCy processing). For the COVID data this goes up to almost 30 hours. The culprit seems to be that a sizable number of documents take about 5 minutes to get processed.

This has to be fixed before we can do large scale processing.

marcverhagen commented 3 months ago

Amongst the 15000 COVID documents, 244 documents (less than 2% of the total) took up 91% of all processing time. Of those 244, 240 documents each took 323 or 324 seconds to process. And the remaining 4 documents each took 3188-3190 seconds to process (almost one hour each). The results for all 244 documents look fine. Also, there is nothing particular about those documents because when I process them separately they each take about a second.

The issue started popping up around the 500th document and after that more regularly each 50-100 documents. So I tried to replicate this on the machine I had the original issue by taking a thousand documents and monitoring processing time per document and CPU and memory usage. It ran fine. Memory use barely increases.

For now I am assuming there was something going on this particular computer when I left it alone overnight to run in batch, but will keep this issue open for a little longer.