kitodo / kitodo-presentation

Kitodo.Presentation is a feature-rich framework for building a METS- or IIIF-based digital library. It is part of the Kitodo Digital Library Suite.
https://kitodo.org
GNU General Public License v3.0
36 stars 45 forks source link

[BUG] Memory leak in kitodo:reindex CLI job #1165

Open stweil opened 5 months ago

stweil commented 5 months ago

Description

Running the reindex jobs works with a small number of documents or very much memory, but fails with out of memory for huge collections. While the job is running, the used memory continuously grows. 10 GB of memory is not sufficient for less than 3000 documents.

Reproduction

Just run the CLI job and watch how the memory usage grows.

Expected Behavior

There is no reason why the required memory should grow with the number of indexed documents. As soon as a document was indexed, there is no longer a need to store its metadata in RAM.

Environment

Additional Context

See also issue #822. The fix added there obviously no longer works.

sebastian-meyer commented 5 months ago

Previously all document instances were kept in AbstractDocument::$registry and the AbstractDocument::clearRegistry() method reset that property, in turn freeing up the memory.

As it seems, this property isn't used any more (at least it isn't populated in AbstractDocument::getInstance()) and so AbstractDocument::clearRegistry() has no effect. Instead the document instances are saved using the TYPO3 caching framework (see AbstractDocument::setDocumentCache()).

So my guess would be that AbstractDocument::clearRegistry() needs to clear the document cache instead of resetting the $registry property in order to make it work again.

(And maybe we should rename it to "clearCache()" as well and remove the unused property $registry while we are at it...)

stweil commented 5 months ago

Thanks. If you have a code snippet, we can test that tomorrow.

sebastian-meyer commented 5 months ago

I provided a pull request in #1168. Thanks for reviewing/testing!

stweil commented 5 months ago

I cherry-picked the 2nd commit (1a2ea0d4c068dd1ca630ab4967f504a2bf484ff1) from that pull request. Indexing still leaks:

Reindex a collection into database and Solr.
============================================

2024-02-04 22:17:20 Indexing 1/2098 with UID "33" ("https://digi.bib.uni-mannheim.de/vl/ubmadesb/oai:digi.bib.uni-mannheim.de:51284.xml") on PID 3 and Solr core 11.
Mem Usage 1: 72380368
Mem Usage 2: 72380368
Mem Usage 3: 76825352
Mem Usage 4: 78123680
Mem Usage 5: 78123552
2024-02-04 22:17:21 Indexing 2/2098 with UID "72" ("https://digi.bib.uni-mannheim.de/fileadmin/digi/416971342/416971342.xml") on PID 3 and Solr core 11.
Mem Usage 1: 78373504
Mem Usage 2: 78373504
Mem Usage 3: 79069448
Mem Usage 4: 80282000
Mem Usage 5: 80282000
[...]
2024-02-04 22:20:58 Indexing 17/2098 with UID "90" ("https://digi.bib.uni-mannheim.de/fileadmin/digi/418980381/418980381_anchor.xml") on PID 3 and Solr core 11.
Mem Usage 1: 124659560
Mem Usage 2: 124659560
Mem Usage 3: 124690384
Mem Usage 4: 124707624
Mem Usage 5: 124707624

The memory traces were produced using this patch:

diff --git a/Classes/Command/ReindexCommand.php b/Classes/Command/ReindexCommand.php
index bcb2da4c..6bb101d6 100644
--- a/Classes/Command/ReindexCommand.php
+++ b/Classes/Command/ReindexCommand.php
@@ -175,14 +175,19 @@ class ReindexCommand extends BaseCommand
                 if ($io->isVerbose()) {
                     $io->writeln(date('Y-m-d H:i:s') . ' Indexing ' . ($id + 1) . '/' . count($documents) . ' with UID "' . $document->getUid() . '" ("' . $document->getLocation() . '") on PID ' . $this->storagePid . ' and Solr core ' . $solrCoreUid . '.');
                 }
+                $io->writeln("Mem Usage 1: ". memory_get_usage());
                 $document->setCurrentDocument($doc);
+                $io->writeln("Mem Usage 2: ". memory_get_usage());
                 // save to database
                 $this->saveToDatabase($document);
+                $io->writeln("Mem Usage 3: ". memory_get_usage());
                 // add to index
                 Indexer::add($document, $this->documentRepository);
             }
+            $io->writeln("Mem Usage 4: ". memory_get_usage());
             // Clear document cache to prevent memory exhaustion.
             AbstractDocument::clearDocumentCache();
+            $io->writeln("Mem Usage 5: ". memory_get_usage());
         }

         $io->success('All done!');
jmechnich commented 5 months ago

@sebastian-meyer I have noticed a registry array in Solr/Solr.php. Is it possible that it replaced the functionality of AbstractDocument::$registry (and should be cleared in a similar fashion)?

sebastian-meyer commented 5 months ago

@sebastian-meyer I have noticed a registry array in Solr/Solr.php. Is it possible that it replaced the functionality of AbstractDocument::$registry (and should be cleared in a similar fashion)?

This one actually holds the core-specific instances of the Solr connection objects. That's fine and should not blow up memory usage, since all indexing uses the same connection instance (i. e. there should be only one instance kept by this property during indexing).

But I found another potential memory leak in BaseCommand::saveToDatabase(), where the TYPO3 / Doctrine DBAL persistence manager is continuously populated with documents but never cleared. I've added a PersistenceManager::clearState() command - let's see if this helps! (See the newest commit.)

beatrycze-volk commented 1 month ago

Short update on this topic.

I have looked into this memory usage:

        foreach ($documents as $id => $document) {
            $doc = AbstractDocument::getInstance($document->getLocation(), ['storagePid' => $this->storagePid], true);

            if ($doc === null) {
                $io->warning('WARNING: Document "' . $document->getLocation() . '" could not be loaded. Skip to next document.');
                continue;
            }

            if ($dryRun) {
                $io->writeln('DRY RUN: Would index ' . ($id + 1) . '/' . count($documents) . ' with UID "' . $document->getUid() . '" ("' . $document->getLocation() . '") on PID ' . $this->storagePid . ' and Solr core ' . $solrCoreUid . '.');
            } else {
                if ($io->isVerbose()) {
                    $io->writeln(date('Y-m-d H:i:s') . ' Indexing ' . ($id + 1) . '/' . count($documents) . ' with UID "' . $document->getUid() . '" ("' . $document->getLocation() . '") on PID ' . $this->storagePid . ' and Solr core ' . $solrCoreUid . '.');
                }
                $io->writeln('Peak usage - begin: ' . round(memory_get_peak_usage() / 1024) . ' KB');
                $io->writeln('--------------------------------------------------');
                $io->writeln('Current usage 1: ' . round(memory_get_usage() / 1024) . ' KB');
                $document->setCurrentDocument($doc);
                $io->writeln('Current usage 2: ' . round(memory_get_usage() / 1024) . ' KB - after setCurrentDocument()');
                // save to database
                $this->saveToDatabase($document, $io);
                $io->writeln('Current usage 3: ' . round(memory_get_usage() / 1024) . ' KB - after saveToDatabase()');
                // add to index
                Indexer::add($document, $this->documentRepository);
                $io->writeln('Current usage 4: ' . round(memory_get_usage() / 1024) . ' KB - after add()');
            }
            // Clear document and persistence cache to prevent memory exhaustion.
            AbstractDocument::clearDocumentCache();
            $io->writeln('Current usage 5: ' . round(memory_get_usage() / 1024) . ' KB - after clearDocumentCache()');
            $this->persistenceManager->clearState();
            $io->writeln('Current usage 6: ' . round(memory_get_usage() / 1024) . ' KB - after clearState()');
            $io->writeln('--------------------------------------------------');
            $io->writeln('Peak usage - end: ' . round(memory_get_peak_usage() / 1024) . ' KB');
        }

Results:

2024-05-08 16:13:14 Indexing 1/4320 with UID "2" ("https://digital.slub-dresden.de/data/kitodo/10Kepi_476251419/10Kepi_476251419_mets.xml") on PID 2 and Solr core 1.
Peak usage - begin: 96421 KB
--------------------------------------------------
Current usage 1: 95976 KB
Current usage 2: 95976 KB - after setCurrentDocument()
Current usage 3: 97118 KB - after saveToDatabase()
Current usage 4: 97796 KB - after add()
Current usage 5: 97796 KB - after clearDocumentCache()
Current usage 6: 88742 KB - after clearState()
--------------------------------------------------
Peak usage - end: 97918 KB

2024-05-08 16:13:17 Indexing 2/4320 with UID "3" ("https://digital.slub-dresden.de/data/kitodo/ZachPoet_477896448_0006/ZachPoet_477896448_0006_anchor.xml") on PID 2 and Solr core 1.
Peak usage - begin: 97918 KB
--------------------------------------------------
Current usage 1: 88758 KB
Current usage 2: 88758 KB - after setCurrentDocument()
Current usage 3: 88834 KB - after saveToDatabase()
Current usage 4: 89011 KB - after add()
Current usage 5: 89011 KB - after clearDocumentCache()
Current usage 6: 88998 KB - after clearState()
--------------------------------------------------
Peak usage - end: 97918 KB

2024-05-08 16:13:18 Indexing 3/4320 with UID "4" ("https://digital.slub-dresden.de/data/kitodo/ZachPoet_477887139_0001/ZachPoet_477887139_0001_mets.xml") on PID 2 and Solr core 1.
Peak usage - begin: 97918 KB
--------------------------------------------------
Current usage 1: 89998 KB
Current usage 2: 89998 KB - after setCurrentDocument()
Current usage 3: 92457 KB - after saveToDatabase()
Current usage 4: 96028 KB - after add()
Current usage 5: 96028 KB - after clearDocumentCache()
Current usage 6: 96010 KB - after clearState()
--------------------------------------------------
Peak usage - end: 97918 KB

2024-05-08 16:15:53 Indexing 4/4320 with UID "5" ("https://digital.slub-dresden.de/data/kitodo/6Saso_476248086/6Saso_476248086_mets.xml") on PID 2 and Solr core 1.
Peak usage - begin: 97918 KB
--------------------------------------------------
Current usage 1: 96067 KB
Current usage 2: 96067 KB - after setCurrentDocument()
Current usage 3: 96240 KB - after saveToDatabase()
Current usage 4: 96346 KB - after add()
Current usage 5: 96346 KB - after clearDocumentCache()
Current usage 6: 96332 KB - after clearState()
--------------------------------------------------
Peak usage - end: 97918 KB

.....

2024-05-08 16:44:07 Indexing 700/4320 with UID "701" ("https://digital.slub-dresden.de/data/kitodo/EdelWns_476250269/EdelWns_476250269_mets.xml") on PID 2 and Solr core 1.
Peak usage - begin: 501309 KB
--------------------------------------------------
Current usage 1: 501044 KB
Current usage 2: 501044 KB - after setCurrentDocument()
Current usage 3: 501322 KB - after saveToDatabase()
Current usage 4: 501447 KB - after add()
Current usage 5: 501447 KB - after clearDocumentCache()
Current usage 6: 501433 KB - after clearState()
--------------------------------------------------
Peak usage - end: 501561 KB

2024-05-08 16:44:08 Indexing 701/4320 with UID "702" ("https://digital.slub-dresden.de/data/kitodo/Dankduald_476250048/Dankduald_476250048_mets.xml") on PID 2 and Solr core 1.
Peak usage - begin: 501619 KB
--------------------------------------------------
Current usage 1: 501485 KB
Current usage 2: 501485 KB - after setCurrentDocument()
Current usage 3: 501649 KB - after saveToDatabase()
Current usage 4: 501737 KB - after add()
Current usage 5: 501737 KB - after clearDocumentCache()
Current usage 6: 501723 KB - after clearState()
--------------------------------------------------
Peak usage - end: 501849 KB
2024-05-08 16:44:08 Indexing 702/4320 with UID "703" ("https://digital.slub-dresden.de/data/kitodo/DalaAdol_476247772/DalaAdol_476247772_mets.xml") on PID 2 and Solr core 1.
Peak usage - begin: 501957 KB
--------------------------------------------------
Current usage 1: 501788 KB
Current usage 2: 501788 KB - after setCurrentDocument()
Current usage 3: 501990 KB - after saveToDatabase()
Current usage 4: 502068 KB - after add()
Current usage 5: 502068 KB - after clearDocumentCache()
Current usage 6: 502054 KB - after clearState()
--------------------------------------------------
Peak usage - end: 502180 KB

2024-05-08 16:44:09 Indexing 703/4320 with UID "704" ("https://digital.slub-dresden.de/data/kitodo/CordColl_477887651/CordColl_477887651_mets.xml") on PID 2 and Solr core 1.
Peak usage - begin: 502224 KB
--------------------------------------------------
Current usage 1: 502102 KB
Current usage 2: 502102 KB - after setCurrentDocument()
Current usage 3: 502260 KB - after saveToDatabase()
Current usage 4: 502310 KB - after add()
Current usage 5: 502310 KB - after clearDocumentCache()
Current usage 6: 502296 KB - after clearState()
--------------------------------------------------
Peak usage - end: 502421 KB

....

2024-05-08 17:23:04 Indexing 1505/4320 with UID "1506" ("https://digital.slub-dresden.de/data/kitodo/DresKr_880547324-1922102801_01-v/DresKr_880547324-1922102801_01-v_mets.xml") on PID 2 and Solr core 1.
Peak usage - begin: 892923 KB
--------------------------------------------------
Current usage 1: 892816 KB
Current usage 2: 892816 KB - after setCurrentDocument()
Current usage 3: 893143 KB - after saveToDatabase()
Current usage 4: 893252 KB - after add()
Current usage 5: 893252 KB - after clearDocumentCache()
Current usage 6: 893224 KB - after clearState()
--------------------------------------------------
Peak usage - end: 893339 KB

So basically setCurrentDocument() and clearDocumentCache() have no effect on memory usage.

saveToDatabase() increase it 120-300 KB for single document and around 1 MB if also parent document is included. add() increase it 50-150 KB and around 3,5 MB if also parent document is included.

clearState() frees around 14 KB for single document and 18 KB for document with the parent.

By this we can calculate that every single document adds between 200 and 400 KB. The main offender is saveToDatabase() but not persistence manager itself as freeing it reduces usage about 5 to 10% of the memory added by this function.

I'm currently analyzing which parts of saveToDatabase() are responsible for this.

In the moment when indexer has passed 1500 documents memory usage is around 872 MB so not even close to the case presented in the https://github.com/kitodo/kitodo-presentation/issues/822. There it is already around 5GB. Please correct me if I have have mistaken something.

Changes can be previewed here: https://github.com/kitodo/kitodo-presentation/compare/master...beatrycze-volk:memory?expand=1

beatrycze-volk commented 1 month ago

UPDATE

Results BaseCommand->saveToDatabase():

2024-05-13 12:33:49 Indexing 1/4320 with UID "2" ("https://digital.slub-dresden.de/data/kitodo/10Kepi_476251419/10Kepi_476251419_mets.xml") on PID 2 and Solr core 1.
Peak usage - begin: 96421 KB
------------------------------------------------------
Current usage 1: 95976 KB - before saveToDatabase()
Document uid: 2
saveToDatabase() usage 1-1: 95976 KB - before setCurrentDocument()
saveToDatabase() usage 1-2: 95976 KB - before getToplevelMetadata()
saveToDatabase() usage 1-3: 96361 KB - after getToplevelMetadata()
saveToDatabase() usage 1-4: 96752 KB - before getParentDocumentUidForSaving()
saveToDatabase() usage 1-5: 96754 KB - before persistAll()
saveToDatabase() usage 1-6: 97136 KB - after persistAll()
Current usage 2: 97118 KB - between saveToDatabase() and add()
Current usage 3: 97796 KB - after add()
Current usage 4: 88742 KB - after clearState()
------------------------------------------------------
Peak usage - end: 97918 KB
2024-05-13 12:33:56 Indexing 2/4320 with UID "3" ("https://digital.slub-dresden.de/data/kitodo/ZachPoet_477896448_0006/ZachPoet_477896448_0006_anchor.xml") on PID 2 and Solr core 1.
Peak usage - begin: 97918 KB
------------------------------------------------------
Current usage 1: 88758 KB - before saveToDatabase()
Document uid: 3
saveToDatabase() usage 1-1: 88758 KB - before setCurrentDocument()
saveToDatabase() usage 1-2: 88788 KB - before getToplevelMetadata()
saveToDatabase() usage 1-3: 88806 KB - after getToplevelMetadata()
saveToDatabase() usage 1-4: 88847 KB - before getParentDocumentUidForSaving()
saveToDatabase() usage 1-5: 88849 KB - before persistAll()
saveToDatabase() usage 1-6: 88849 KB - after persistAll()
Current usage 2: 88834 KB - between saveToDatabase() and add()
Current usage 3: 89011 KB - after add()
Current usage 4: 88998 KB - after clearState()
------------------------------------------------------
Peak usage - end: 97918 KB
2024-05-13 12:33:56 Indexing 3/4320 with UID "4" ("https://digital.slub-dresden.de/data/kitodo/ZachPoet_477887139_0001/ZachPoet_477887139_0001_mets.xml") on PID 2 and Solr core 1.
Peak usage - begin: 97918 KB
------------------------------------------------------
Current usage 1: 89998 KB - before saveToDatabase()
Document uid: 4
saveToDatabase() usage 1-1: 89998 KB - before setCurrentDocument()
saveToDatabase() usage 1-2: 90028 KB - before getToplevelMetadata()
saveToDatabase() usage 1-3: 92354 KB - after getToplevelMetadata()
saveToDatabase() usage 1-4: 92396 KB - before getParentDocumentUidForSaving()
Document parent uid: 4
Document uid: 3
saveToDatabase() usage 1-1: 92439 KB - before setCurrentDocument()
saveToDatabase() usage 1-2: 92439 KB - before getToplevelMetadata()
saveToDatabase() usage 1-3: 92457 KB - after getToplevelMetadata()
saveToDatabase() usage 1-4: 92484 KB - before getParentDocumentUidForSaving()
saveToDatabase() usage 1-5: 92485 KB - before persistAll()
saveToDatabase() usage 1-6: 92487 KB - after persistAll()
saveToDatabase() usage 1-5: 92474 KB - before persistAll()
saveToDatabase() usage 1-6: 92476 KB - after persistAll()
Current usage 2: 92457 KB - between saveToDatabase() and add()
Current usage 3: 96068 KB - after add()
Current usage 4: 96050 KB - after clearState()
------------------------------------------------------
Peak usage - end: 97918 KB
2024-05-13 12:36:11 Indexing 4/4320 with UID "5" ("https://digital.slub-dresden.de/data/kitodo/6Saso_476248086/6Saso_476248086_mets.xml") on PID 2 and Solr core 1.
Peak usage - begin: 97918 KB
------------------------------------------------------
Current usage 1: 96107 KB - before saveToDatabase()
Document uid: 5
saveToDatabase() usage 1-1: 96107 KB - before setCurrentDocument()
saveToDatabase() usage 1-2: 96136 KB - before getToplevelMetadata()
saveToDatabase() usage 1-3: 96254 KB - after getToplevelMetadata()
saveToDatabase() usage 1-4: 96295 KB - before getParentDocumentUidForSaving()
saveToDatabase() usage 1-5: 96296 KB - before persistAll()
saveToDatabase() usage 1-6: 96298 KB - after persistAll()
Current usage 2: 96280 KB - between saveToDatabase() and add()
Current usage 3: 96386 KB - after add()
Current usage 4: 96372 KB - after clearState()
------------------------------------------------------
Peak usage - end: 97918 KB
2024-05-13 12:36:12 Indexing 5/4320 with UID "6" ("https://digital.slub-dresden.de/data/kitodo/5PiG_476250676/5PiG_476250676_mets.xml") on PID 2 and Solr core 1.
Peak usage - begin: 97918 KB
------------------------------------------------------
Current usage 1: 96425 KB - before saveToDatabase()
Document uid: 6
saveToDatabase() usage 1-1: 96425 KB - before setCurrentDocument()
saveToDatabase() usage 1-2: 96454 KB - before getToplevelMetadata()
saveToDatabase() usage 1-3: 96568 KB - after getToplevelMetadata()
saveToDatabase() usage 1-4: 96609 KB - before getParentDocumentUidForSaving()
saveToDatabase() usage 1-5: 96610 KB - before persistAll()
saveToDatabase() usage 1-6: 96612 KB - after persistAll()
Current usage 2: 96594 KB - between saveToDatabase() and add()
Current usage 3: 96726 KB - after add()
Current usage 4: 96712 KB - after clearState()
------------------------------------------------------
Peak usage - end: 97918 KB

....

2024-05-13 12:46:24 Indexing 354/4320 with UID "355" ("https://digital.slub-dresden.de/data/kitodo/BchIhm_476250706/BchIhm_476250706_mets.xml") on PID 2 and Solr core 1.
Peak usage - begin: 309513 KB
------------------------------------------------------
Current usage 1: 309068 KB - before saveToDatabase()
Document uid: 355
saveToDatabase() usage 1-1: 309068 KB - before setCurrentDocument()
saveToDatabase() usage 1-2: 309098 KB - before getToplevelMetadata()
saveToDatabase() usage 1-3: 309490 KB - after getToplevelMetadata()
saveToDatabase() usage 1-4: 309531 KB - before getParentDocumentUidForSaving()
saveToDatabase() usage 1-5: 309533 KB - before persistAll()
saveToDatabase() usage 1-6: 309535 KB - after persistAll()
Current usage 2: 309516 KB - between saveToDatabase() and add()
Current usage 3: 309747 KB - after add()
Current usage 4: 309733 KB - after clearState()
------------------------------------------------------
Peak usage - end: 309867 KB
2024-05-13 12:46:25 Indexing 355/4320 with UID "356" ("https://digital.slub-dresden.de/data/kitodo/BegrGo_476250447/BegrGo_476250447_mets.xml") on PID 2 and Solr core 1.
Peak usage - begin: 310159 KB
------------------------------------------------------
Current usage 1: 309846 KB - before saveToDatabase()
Document uid: 356
saveToDatabase() usage 1-1: 309846 KB - before setCurrentDocument()
saveToDatabase() usage 1-2: 309876 KB - before getToplevelMetadata()
saveToDatabase() usage 1-3: 310144 KB - after getToplevelMetadata()
saveToDatabase() usage 1-4: 310185 KB - before getParentDocumentUidForSaving()
saveToDatabase() usage 1-5: 310187 KB - before persistAll()
saveToDatabase() usage 1-6: 310188 KB - after persistAll()
Current usage 2: 310170 KB - between saveToDatabase() and add()
Current usage 3: 310366 KB - after add()
Current usage 4: 310352 KB - after clearState()
------------------------------------------------------
Peak usage - end: 310484 KB
2024-05-13 12:46:25 Indexing 356/4320 with UID "357" ("https://digital.slub-dresden.de/data/kitodo/GssIst_476247438/GssIst_476247438_mets.xml") on PID 2 and Solr core 1.
Peak usage - begin: 310506 KB
------------------------------------------------------
Current usage 1: 310397 KB - before saveToDatabase()
Document uid: 357
saveToDatabase() usage 1-1: 310397 KB - before setCurrentDocument()
saveToDatabase() usage 1-2: 310427 KB - before getToplevelMetadata()
saveToDatabase() usage 1-3: 310516 KB - after getToplevelMetadata()
saveToDatabase() usage 1-4: 310557 KB - before getParentDocumentUidForSaving()
saveToDatabase() usage 1-5: 310559 KB - before persistAll()
saveToDatabase() usage 1-6: 310561 KB - after persistAll()
Current usage 2: 310542 KB - between saveToDatabase() and add()
Current usage 3: 310651 KB - after add()
Current usage 4: 310637 KB - after clearState()
------------------------------------------------------
Peak usage - end: 310815 KB
2024-05-13 12:46:26 Indexing 357/4320 with UID "358" ("https://digital.slub-dresden.de/data/kitodo/AnhaDie_476246342/AnhaDie_476246342_mets.xml") on PID 2 and Solr core 1.
Peak usage - begin: 310815 KB
------------------------------------------------------
Current usage 1: 310682 KB - before saveToDatabase()
Document uid: 358
saveToDatabase() usage 1-1: 310682 KB - before setCurrentDocument()
saveToDatabase() usage 1-2: 310712 KB - before getToplevelMetadata()
saveToDatabase() usage 1-3: 310800 KB - after getToplevelMetadata()
saveToDatabase() usage 1-4: 310841 KB - before getParentDocumentUidForSaving()
saveToDatabase() usage 1-5: 310843 KB - before persistAll()
saveToDatabase() usage 1-6: 310844 KB - after persistAll()
Current usage 2: 310826 KB - between saveToDatabase() and add()
Current usage 3: 310894 KB - after add()
Current usage 4: 310880 KB - after clearState()
------------------------------------------------------
Peak usage - end: 311006 KB
2024-05-13 12:46:27 Indexing 358/4320 with UID "359" ("https://digital.slub-dresden.de/data/kitodo/GrelMass_476248485/GrelMass_476248485_mets.xml") on PID 2 and Solr core 1.
Peak usage - begin: 311006 KB
------------------------------------------------------
Current usage 1: 310908 KB - before saveToDatabase()
Document uid: 359
saveToDatabase() usage 1-1: 310908 KB - before setCurrentDocument()
saveToDatabase() usage 1-2: 310938 KB - before getToplevelMetadata()
saveToDatabase() usage 1-3: 310982 KB - after getToplevelMetadata()
saveToDatabase() usage 1-4: 311023 KB - before getParentDocumentUidForSaving()
saveToDatabase() usage 1-5: 311025 KB - before persistAll()
saveToDatabase() usage 1-6: 311027 KB - after persistAll()
Current usage 2: 311008 KB - between saveToDatabase() and add()
Current usage 3: 311057 KB - after add()
Current usage 4: 311043 KB - after clearState()
------------------------------------------------------
Peak usage - end: 311169 KB

The part which is increasing the memory usage is the function AbstractDocument::getToplevelMetadata(). For some reason clearing the AbstractDocument from cache (AbstractDocument::clearDocumentCache()) doesn't bring any result. Even if it should free anything related to document instance.

Results Indexer::add():

2024-05-13 13:06:40 Indexing 4/4320 with UID "5" ("https://digital.slub-dresden.de/data/kitodo/6Saso_476248086/6Saso_476248086_mets.xml") on PID 2 and Solr core 1.
Peak usage - begin: 97918 KB
------------------------------------------------------
Current usage 1: 96115 KB - before saveToDatabase()
Current usage 2: 96288 KB - between saveToDatabase() and add()
add() usage 2-1: 96288 KB - before self::$processedDocs check
add() usage 2-2: 96288 KB - after solr connect
add() usage 2-5: 96288 KB - before document save
add() usage 2-6: 96289 KB - after document save
add() usage 2-7: 96395 KB - after processLogical()
add() usage 2-8: 96395 KB - after processPhysical()
add() usage 2-9: 96395 KB - after update()
Current usage 3: 96394 KB - after add()
Current usage 4: 96380 KB - after clearState()
------------------------------------------------------
Peak usage - end: 97918 KB
2024-05-13 13:06:40 Indexing 5/4320 with UID "6" ("https://digital.slub-dresden.de/data/kitodo/5PiG_476250676/5PiG_476250676_mets.xml") on PID 2 and Solr core 1.
Peak usage - begin: 97918 KB
------------------------------------------------------
Current usage 1: 96433 KB - before saveToDatabase()
Current usage 2: 96602 KB - between saveToDatabase() and add()
add() usage 2-1: 96602 KB - before self::$processedDocs check
add() usage 2-2: 96602 KB - after solr connect
add() usage 2-5: 96602 KB - before document save
add() usage 2-6: 96603 KB - after document save
add() usage 2-7: 96735 KB - after processLogical()
add() usage 2-8: 96735 KB - after processPhysical()
add() usage 2-9: 96735 KB - after update()
Current usage 3: 96734 KB - after add()
Current usage 4: 96720 KB - after clearState()
------------------------------------------------------
Peak usage - end: 97918 KB
2024-05-13 13:06:41 Indexing 6/4320 with UID "7" ("https://digital.slub-dresden.de/data/kitodo/6SoG_476247896/6SoG_476247896_mets.xml") on PID 2 and Solr core 1.
Peak usage - begin: 97918 KB
------------------------------------------------------
Current usage 1: 96885 KB - before saveToDatabase()
Current usage 2: 97369 KB - between saveToDatabase() and add()
add() usage 2-1: 97369 KB - before self::$processedDocs check
add() usage 2-2: 97369 KB - after solr connect
add() usage 2-5: 97369 KB - before document save
add() usage 2-6: 97370 KB - after document save
add() usage 2-7: 97492 KB - after processLogical()
add() usage 2-8: 97492 KB - after processPhysical()
add() usage 2-9: 97491 KB - after update()
Current usage 3: 97491 KB - after add()
Current usage 4: 97477 KB - after clearState()
------------------------------------------------------
Peak usage - end: 97918 KB
2024-05-13 13:06:42 Indexing 7/4320 with UID "8" ("https://digital.slub-dresden.de/data/kitodo/6Kepi_47625079X/6Kepi_47625079X_mets.xml") on PID 2 and Solr core 1.
Peak usage - begin: 97918 KB
------------------------------------------------------
Current usage 1: 97526 KB - before saveToDatabase()
Current usage 2: 97676 KB - between saveToDatabase() and add()
add() usage 2-1: 97676 KB - before self::$processedDocs check
add() usage 2-2: 97676 KB - after solr connect
add() usage 2-5: 97676 KB - before document save
add() usage 2-6: 97677 KB - after document save
add() usage 2-7: 97791 KB - after processLogical()
add() usage 2-8: 97791 KB - after processPhysical()
add() usage 2-9: 97790 KB - after update()
Current usage 3: 97790 KB - after add()
Current usage 4: 97776 KB - after clearState()
------------------------------------------------------
Peak usage - end: 97918 KB
2024-05-13 13:06:42 Indexing 8/4320 with UID "9" ("https://digital.slub-dresden.de/data/kitodo/6FuG_476251729/6FuG_476251729_mets.xml") on PID 2 and Solr core 1.
Peak usage - begin: 97962 KB
------------------------------------------------------
Current usage 1: 97828 KB - before saveToDatabase()
Current usage 2: 97995 KB - between saveToDatabase() and add()
add() usage 2-1: 97995 KB - before self::$processedDocs check
add() usage 2-2: 97995 KB - after solr connect
add() usage 2-5: 97995 KB - before document save
add() usage 2-6: 97996 KB - after document save
add() usage 2-7: 98100 KB - after processLogical()
add() usage 2-8: 98100 KB - after processPhysical()
add() usage 2-9: 98099 KB - after update()
Current usage 3: 98099 KB - after add()
Current usage 4: 98085 KB - after clearState()
------------------------------------------------------
Peak usage - end: 98212 KB
2024-05-13 13:06:44 Indexing 9/4320 with UID "10" ("https://digital.slub-dresden.de/data/kitodo/4ArG_476251087/4ArG_476251087_mets.xml") on PID 2 and Solr core 1.
Peak usage - begin: 98212 KB
------------------------------------------------------
Current usage 1: 98117 KB - before saveToDatabase()
Current usage 2: 98230 KB - between saveToDatabase() and add()
add() usage 2-1: 98230 KB - before self::$processedDocs check
add() usage 2-2: 98230 KB - after solr connect
add() usage 2-5: 98230 KB - before document save
add() usage 2-6: 98231 KB - after document save
add() usage 2-7: 98317 KB - after processLogical()
add() usage 2-8: 98317 KB - after processPhysical()
add() usage 2-9: 98316 KB - after update()
Current usage 3: 98316 KB - after add()
Current usage 4: 98302 KB - after clearState()
------------------------------------------------------
Peak usage - end: 98429 KB

......

2024-05-13 15:11:52 Indexing 3946/4320 with UID "3947" ("https://digital.slub-dresden.de/data/kitodo/DresKr_880547324-1902041201_01-v/DresKr_880547324-1902041201_01-v_mets.xml") on PID 2 and Solr core 1.
Peak usage - begin: 1990710 KB
------------------------------------------------------
Current usage 1: 1990599 KB - before saveToDatabase()
Document parent uid: 3947
Document parent uid: 956
add() usage 2-1: 1990939 KB - before self::$processedDocs check
add() usage 2-1: 1990932 KB - before self::$processedDocs check
Current usage 2: 1990921 KB - between saveToDatabase() and add()
add() usage 2-1: 1990921 KB - before self::$processedDocs check
add() usage 2-2: 1990921 KB - after solr connect
add() usage 2-3: 1990921 KB - before handle parent document
add() usage 2-1: 1990954 KB - before self::$processedDocs check
add() usage 2-4: 1990954 KB - after handle parent document
add() usage 2-5: 1990954 KB - before document save
add() usage 2-6: 1990955 KB - after document save
add() usage 2-7: 1990990 KB - after processLogical()
add() usage 2-8: 1991010 KB - after processPhysical()
add() usage 2-9: 1991010 KB - after update()
Current usage 3: 1991009 KB - after add()
Current usage 4: 1990982 KB - after clearState()
------------------------------------------------------
Peak usage - end: 1991161 KB
2024-05-13 15:11:54 Indexing 3947/4320 with UID "3948" ("https://digital.slub-dresden.de/data/kitodo/DresKr_880547324-1893070101_01-v/DresKr_880547324-1893070101_01-v_mets.xml") on PID 2 and Solr core 1.
Peak usage - begin: 1991161 KB
------------------------------------------------------
Current usage 1: 1991007 KB - before saveToDatabase()
Document parent uid: 3948
Document parent uid: 966
add() usage 2-1: 1991351 KB - before self::$processedDocs check
add() usage 2-1: 1991343 KB - before self::$processedDocs check
Current usage 2: 1991333 KB - between saveToDatabase() and add()
add() usage 2-1: 1991333 KB - before self::$processedDocs check
add() usage 2-2: 1991333 KB - after solr connect
add() usage 2-3: 1991333 KB - before handle parent document
add() usage 2-1: 1991370 KB - before self::$processedDocs check
add() usage 2-4: 1991370 KB - after handle parent document
add() usage 2-5: 1991370 KB - before document save
add() usage 2-6: 1991371 KB - after document save
add() usage 2-7: 1991405 KB - after processLogical()
add() usage 2-8: 1991422 KB - after processPhysical()
add() usage 2-9: 1991421 KB - after update()
Current usage 3: 1991421 KB - after add()
Current usage 4: 1991393 KB - after clearState()
------------------------------------------------------
Peak usage - end: 1991560 KB
2024-05-13 15:11:56 Indexing 3948/4320 with UID "3949" ("https://digital.slub-dresden.de/data/kitodo/DresKr_880547324-1895052501_01-v/DresKr_880547324-1895052501_01-v_mets.xml") on PID 2 and Solr core 1.
Peak usage - begin: 1991560 KB
------------------------------------------------------
Current usage 1: 1991418 KB - before saveToDatabase()
Document parent uid: 3949
Document parent uid: 969
add() usage 2-1: 1991762 KB - before self::$processedDocs check
add() usage 2-1: 1991755 KB - before self::$processedDocs check
Current usage 2: 1991744 KB - between saveToDatabase() and add()
add() usage 2-1: 1991744 KB - before self::$processedDocs check
add() usage 2-2: 1991744 KB - after solr connect
add() usage 2-3: 1991744 KB - before handle parent document
Current usage 3: 1991744 KB - after add()
Current usage 4: 1991719 KB - after clearState()
------------------------------------------------------
Peak usage - end: 1991893 KB

....

2024-05-13 15:24:01 Indexing 4320/4320 with UID "4321" ("https://digital.slub-dresden.de/data/kitodo/DresKr_880547324-1937102602_02-a/DresKr_880547324-1937102602_02-a_mets.xml") on PID 2 and Solr core 1.
Peak usage - begin: 2160047 KB
------------------------------------------------------
Current usage 1: 2159891 KB - before saveToDatabase()
Document parent uid: 4321
Document parent uid: 2514
add() usage 2-1: 2160267 KB - before self::$processedDocs check
add() usage 2-1: 2160259 KB - before self::$processedDocs check
Current usage 2: 2160249 KB - between saveToDatabase() and add()
add() usage 2-1: 2160249 KB - before self::$processedDocs check
add() usage 2-2: 2160249 KB - after solr connect
add() usage 2-3: 2160249 KB - before handle parent document
add() usage 2-1: 2160318 KB - before self::$processedDocs check
add() usage 2-4: 2160318 KB - after handle parent document
add() usage 2-5: 2160318 KB - before document save
add() usage 2-6: 2160319 KB - after document save
add() usage 2-7: 2160353 KB - after processLogical()
add() usage 2-8: 2160362 KB - after processPhysical()
add() usage 2-9: 2160361 KB - after update()
Current usage 3: 2160361 KB - after add()
Current usage 4: 2160333 KB - after clearState()
------------------------------------------------------
Peak usage - end: 2160514 KB

 [OK] All done!

It looks that actually the biggest increase in memory usage brings Indexer::processLogical() function.

One obvious reason why size is growing: self::$processedDocs is getting bigger with every iteration.

Other possible reasons are Indexer::solr or Indexer::fields. There is a risk that one of those static variables stores some data incremental.

I'm continuing my investigation.

beatrycze-volk commented 1 month ago

Changes implemented to improve performance: https://github.com/kitodo/kitodo-presentation/pull/1195 https://github.com/kitodo/kitodo-presentation/pull/1196 https://github.com/kitodo/kitodo-presentation/pull/1197 https://github.com/kitodo/kitodo-presentation/pull/1198 https://github.com/kitodo/kitodo-presentation/pull/1199 https://github.com/kitodo/kitodo-presentation/pull/1200

Feature for enabling the indexing of documents in batches: https://github.com/kitodo/kitodo-presentation/pull/1201

jmechnich commented 1 month ago

@sebastian-meyer Did this actually fix the issue? We are trying to test but the merged code is not working out of the box (the reindex command fails on the second document).

sebastian-meyer commented 1 month ago

I didn't test it myself. Maybe @beatrycze-volk can give us a hint?

BFallert commented 1 month ago

When I start a reindex process, I always get an error message with the 2nd data record


Reindex a collection into database and Solr.
============================================

2024-05-21 14:59:38 Indexing 1/135 with UID "2" ("https://digi.bib.uni-mannheim.de/fileadmin/digi/490431607/490431607.xml") on PID 3 and Solr core 11.
2024-05-21 14:59:54 Indexing 2/135 with UID "4" ("https://digi.bib.uni-mannheim.de/fileadmin/digi/490178111/490178111_anchor.xml") on PID 3 and Solr core 11.

In PersistenceManager.php line 220:

  **[TYPO3\CMS\Extbase\Persistence\Exception\UnknownObjectException (1249479819)]                                 
  The object of type "Kitodo\Dlf\Domain\Model\Document" given to update must be persisted already, but is new.**  

Exception trace:
  at /var/www/typo3/public/typo3/sysext/extbase/Classes/Persistence/Generic/PersistenceManager.php:220
 TYPO3\CMS\Extbase\Persistence\Generic\PersistenceManager->update() at /var/www/typo3/public/typo3/sysext/extbase/Classes/Persistence/Repository.php:117
 TYPO3\CMS\Extbase\Persistence\Repository->update() at /var/www/typo3/packages/bfallert/presentation/Classes/Command/BaseCommand.php:275
 Kitodo\Dlf\Command\BaseCommand->saveToDatabase() at /var/www/typo3/packages/bfallert/presentation/Classes/Command/ReindexCommand.php:204
 Kitodo\Dlf\Command\ReindexCommand->execute() at /var/www/typo3/vendor/symfony/console/Command/Command.php:298
 Symfony\Component\Console\Command\Command->run() at /var/www/typo3/vendor/symfony/console/Application.php:1040
 Symfony\Component\Console\Application->doRunCommand() at /var/www/typo3/vendor/symfony/console/Application.php:301
 Symfony\Component\Console\Application->doRun() at /var/www/typo3/vendor/symfony/console/Application.php:171
 Symfony\Component\Console\Application->run() at /var/www/typo3/public/typo3/sysext/core/Classes/Console/CommandApplication.php:91
 TYPO3\CMS\Core\Console\CommandApplication->run() at /var/www/typo3/vendor/typo3/cms-cli/typo3:23
 {closure}() at /var/www/typo3/vendor/typo3/cms-cli/typo3:24

kitodo:reindex [--dry-run] [-c|--coll COLL] [-p|--pid PID] [-s|--solr SOLR] [-o|--owner [OWNER]] [-a|--all] [-il|--index-limit [INDEX-LIMIT]] [-is|--index-start [INDEX-START]]
beatrycze-volk commented 1 month ago

I was running this many times... Never got into this error. It can be that something has gone wrong when I was splitting changes to own branches. I will check it out.

jmechnich commented 1 month ago

I was running this many times... Never got into this error. It can be that something has gone wrong when I was splitting changes to own branches. I will check it out.

@beatrycze-volk Is it possible that a clean index from scratch is required for things to work after including your latest changes?

beatrycze-volk commented 1 month ago

I have checked out and found out what is the reason. Of course while I was moving changes over I have missed the one part.

But in fact I'm not sure if this change is even needed now as later I have switched my approach to chunk indexing. Actually state clearing makes now more sense after given chunk of the documents instead after every document. I will make today PR for it.

beatrycze-volk commented 1 month ago

I was running this many times... Never got into this error. It can be that something has gone wrong when I was splitting changes to own branches. I will check it out.

@beatrycze-volk Is it possible that a clean index from scratch is required for things to work after including your latest changes?

No, that's not necessary.

beatrycze-volk commented 1 month ago

https://github.com/kitodo/kitodo-presentation/pull/1203 here also bug fix for reindex options.

jmechnich commented 1 month ago

We have added the change from PR #1202 manually and reindexing now proceeds past the first two docs. We will report back any news about memory issues. Thanks @beatrycze-volk !

sebastian-meyer commented 1 month ago

I merged both PRs. Can you confirm that I can close this issue?

BFallert commented 1 month ago

I merged both PRs. Can you confirm that I can close this issue?

I have adopted both PRs. Currently, there is still the problem with the permanent increase in memory consumption. So far there has been no improvement, so the memory leak has not been fixed. It is now only possible to work around this problem by manually setting a limit and a starting point (see also https://github.com/kitodo/kitodo-presentation/issues/1205). For large collections, however, this is somewhat complicated, as the memory consumption depends not only on the number of documents but also on the number of pages of the documents.

sebastian-meyer commented 1 month ago

Would it be possible to let the indexer fork itself after indexing a couple of documents? I haven't worked with multi-threading in PHP, yet, but maybe someone knows how to do it? I assume having multiple threads may help with garbage collection and constantly freeing up memory.

What do you think?

beatrycze-volk commented 1 month ago

I have tried to even manually force garbage collection after each single document. It didn't give any improvement. As it was constantly showing no cycles to be collected and memory growing.

I was trying to close and recreate SOLR client after every document, the same with persistence manager. Still no improvement.

I have split functions into smaller as this should call garbage collections more often - no improvement.

I was unsetting every variable in every loop, still no difference.

Then I have started to search in the internet and found this quiet old topic - https://stackoverflow.com/a/849618/2701807:

There are several possible points of memory leaking in php:

php itself php extension php library you use your php code It is quite hard to find and fix the first 3 without deep reverse engineering or php source code knowledge. For the last one you can use binary search for memory leaking code with memory_get_usage

And there also newer comment:

Its a shame that even php 7.2 they are not able to fix core php memory leaks. You cannot run long running processes in it.

The upgrade has introduced not only changes in our code, but also usage of new stuff for e.g Extbase. As I was trying removing and resetting everything in our code I would sadly suspect that problem can be in some other layer.

beatrycze-volk commented 1 month ago

Would it be possible to let the indexer fork itself after indexing a couple of documents? I haven't worked with multi-threading in PHP, yet, but maybe someone knows how to do it? I assume having multiple threads may help with garbage collection and constantly freeing up memory.

What do you think?

I was thinking about it and the answer is no. I wanted to implement the functionality in which one command will call other commands something like administrator calls reindex all and then Kitodo calls commands for chunk indexing so the memory is freed in between. but according to the documentation it is not possible:

Note that all the commands will run in the same process and some of Symfony's built-in commands may not work well this way. For instance, the cache:clear and cache:warmup commands change some class definitions, so running something after them is likely to break. https://symfony.com/doc/current/console/calling_commands.html

all the commands will run in the same process - then no difference for our case as garbage collection will still do nothing.

There are currently two possibilities:

sebastian-meyer commented 1 month ago

OK, I'll release the current version with the workaround although it's quite annoying.

For 5.1 we already planned to support newer PHP version up to 8.3, so maybe this will help with indexing as well (see #1172).

sebastian-meyer commented 1 month ago

The master branch now supports PHP up to version 8.3.

stweil commented 1 month ago

I don't expect that PHP 8.3 will have an effect on the memory leak, but maybe we'll know more soon.