kitodo / kitodo-production

Kitodo.Production is a workflow management tool for mass digitization and is part of the Kitodo Digital Library Suite.
http://www.kitodo.org/software/kitodoproduction/
GNU General Public License v3.0
62 stars 62 forks source link

Lots of file existence checks when creating newspapers #4220

Open matthias-ronge opened 3 years ago

matthias-ronge commented 3 years ago

There is a lot of checking for the same files exist again and again when creating newspapers. File system access is comparably slow in Java and should be avoided to happen again and again.

[INFO ] 2021-02-26 13:16:46.690 [Erzeuge Zeitungsvorgänge: BudiFrunA_020166176] MetsService - Saving 3146/meta.xml
[TRACE] 2021-02-26 13:16:46.997 [Erzeuge Zeitungsvorgänge: BudiFrunA_020166176] FileManagement - Found 3146//meta.xml
[TRACE] 2021-02-26 13:16:46.998 [Erzeuge Zeitungsvorgänge: BudiFrunA_020166176] FileManagement - Found 3146//meta.xml
[TRACE] 2021-02-26 13:16:46.999 [Erzeuge Zeitungsvorgänge: BudiFrunA_020166176] FileManagement - Found 3146//meta.xml
[INFO ] 2021-02-26 13:16:47.000 [Erzeuge Zeitungsvorgänge: BudiFrunA_020166176] MetsService - Reading 3146//meta.xml
[TRACE] 2021-02-26 13:16:47.846 [Erzeuge Zeitungsvorgänge: BudiFrunA_020166176] FileManagement - No such file: 3147/meta.xml
[INFO ] 2021-02-26 13:16:47.846 [Erzeuge Zeitungsvorgänge: BudiFrunA_020166176] ProcessService - No metadata file for indexing: 3147/meta.xml
[TRACE] 2021-02-26 13:16:47.846 [Erzeuge Zeitungsvorgänge: BudiFrunA_020166176] FileManagement - No such file: 3147/meta.xml
[INFO ] 2021-02-26 13:16:47.846 [Erzeuge Zeitungsvorgänge: BudiFrunA_020166176] ProcessService - Could not determine base type for process BudiFrunA_020166176-1981 [3147]: Metadata file not found : 3147/meta.xml
[TRACE] 2021-02-26 13:16:48.812 [Erzeuge Zeitungsvorgänge: BudiFrunA_020166176] FileManagement - Found 3146//meta.xml
[TRACE] 2021-02-26 13:16:48.813 [Erzeuge Zeitungsvorgänge: BudiFrunA_020166176] FileManagement - Found 3146//meta.xml
[TRACE] 2021-02-26 13:16:48.815 [Erzeuge Zeitungsvorgänge: BudiFrunA_020166176] FileManagement - Found 3146//meta.xml
[INFO ] 2021-02-26 13:16:48.816 [Erzeuge Zeitungsvorgänge: BudiFrunA_020166176] MetsService - Reading 3146//meta.xml
[TRACE] 2021-02-26 13:16:49.733 [Erzeuge Zeitungsvorgänge: BudiFrunA_020166176] FileManagement - Found 3361/meta.xml
[TRACE] 2021-02-26 13:16:49.733 [Erzeuge Zeitungsvorgänge: BudiFrunA_020166176] FileManagement - Found 3361/meta.xml
[TRACE] 2021-02-26 13:16:49.735 [Erzeuge Zeitungsvorgänge: BudiFrunA_020166176] FileManagement - Found 3361/meta.xml
[INFO ] 2021-02-26 13:16:49.735 [Erzeuge Zeitungsvorgänge: BudiFrunA_020166176] MetsService - Reading 3361/meta.xml
[TRACE] 2021-02-26 13:16:50.619 [Erzeuge Zeitungsvorgänge: BudiFrunA_020166176] FileManagement - Found 3146//meta.xml
[TRACE] 2021-02-26 13:16:50.619 [Erzeuge Zeitungsvorgänge: BudiFrunA_020166176] FileManagement - Found 3146//meta.xml
[TRACE] 2021-02-26 13:16:50.621 [Erzeuge Zeitungsvorgänge: BudiFrunA_020166176] FileManagement - Found 3146//meta.xml
[INFO ] 2021-02-26 13:16:50.622 [Erzeuge Zeitungsvorgänge: BudiFrunA_020166176] MetsService - Reading 3146//meta.xml
[TRACE] 2021-02-26 13:16:51.576 [Erzeuge Zeitungsvorgänge: BudiFrunA_020166176] FileManagement - No such file: 3147/meta.xml
[INFO ] 2021-02-26 13:16:51.576 [Erzeuge Zeitungsvorgänge: BudiFrunA_020166176] ProcessService - No metadata file for indexing: 3147/meta.xml
[TRACE] 2021-02-26 13:16:51.576 [Erzeuge Zeitungsvorgänge: BudiFrunA_020166176] FileManagement - No such file: 3147/meta.xml
[INFO ] 2021-02-26 13:16:51.576 [Erzeuge Zeitungsvorgänge: BudiFrunA_020166176] ProcessService - Could not determine base type for process BudiFrunA_020166176-1981 [3147]: Metadata file not found : 3147/meta.xml
[TRACE] 2021-02-26 13:16:52.278 [Erzeuge Zeitungsvorgänge: BudiFrunA_020166176] FileManagement - Found 3146//meta.xml
[TRACE] 2021-02-26 13:16:52.278 [Erzeuge Zeitungsvorgänge: BudiFrunA_020166176] FileManagement - Found 3146//meta.xml
[TRACE] 2021-02-26 13:16:52.280 [Erzeuge Zeitungsvorgänge: BudiFrunA_020166176] FileManagement - Found 3146//meta.xml

Goal: For known fils, don’t check existence.

andre-hohmann commented 2 years ago

@matthias-ronge: Could this be the reason for the following issue?

The behaviour of #4760 occurs in our test system with more then 400.000 processes. In the preview system the creation of newspaper processes is not slow.

andre-hohmann commented 2 years ago

However, the general check for duplicate processes must be possible before newspaper processes are created!

matthias-ronge commented 2 years ago

@matthias-ronge: Could this be the reason for the following issue?

Performance problem: Creating newspaper processes - creation of processes is very slow #4760

The behaviour of #4760 occurs in our test system with more then 400.000 processes. In the preview system the creation of newspaper processes is not slow.

No idea; but when process creation slows down with the total number of processes, we should pay attention to the database queries and index actions that are taking place. If, for each newly created process, the project is re-indexed, and the meta.xml is parsed for each process of the project—or something comparable—then these actions add up.

andre-hohmann commented 2 years ago

@matthias-ronge : Is this still a problem or has it been solved by the following pull request?