korpling / pepperModules-MergingModule

This project provides a Pepper module for the merging of data on several possible levels.
Other
2 stars 2 forks source link

Merging module speed strongly depends on file tree structure #16

Open MartinKl opened 3 years ago

MartinKl commented 3 years ago

When merging data the merger matches corresponding files by their file path. A large number of files in the same folder (or parent node) seem to significantly slow down processing (not even certain the process would ever terminate).

My personal example, but simplified: I have data of speakers of two age groups (adolescents, adults) and two speaker types (monolingual vs. bilingual). And for each I have a file in two formats. Consider the following arrangement (I):

FORMAT_A/CORPUS/MONOLINGUAL/*.format_a (48 files)
FORMAT_A/CORPUS/BILINGUAL/*.format_a (128 files)
FORMAT_B/CORPUS/MONOLINGUAL/*.format_b (48 filess)
FORMAT_B/CORPUS/BILINGUAL/*.format_b (128 files)

Trying to merge the imports of FORMAT_A_Importer and FORMAT_B_Importer does not terminate or is at least very very slow.

Another view on the data could be (II):

FORMAT_A/CORPUS/ADULTS/MONOLINGUAL/*.format_a (24 files)
FORMAT_A/CORPUS/ADULTS/BILINGUAL/*.format_a (64 files)
FORMAT_A/CORPUS/ADOLESCENTS/MONOLINGUAL/*.format_a (24 files)
FORMAT_A/CORPUS/ADOLESCENTS/BILINGUAL/*.format_a (64 files)
FORMAT_B/CORPUS/ADULTS/MONOLINGUAL/*.format_b (24 files)
FORMAT_B/CORPUS/ADULTS/BILINGUAL/*.format_b (64 files)
FORMAT_B/CORPUS/ADOLESCENTS/MONOLINGUAL/*.format_b (24 files)
FORMAT_B/CORPUS/ADOLESCENTS/BILINGUAL/*.format_b (64 files)

Arranging the data like this leads to successful merging. Not sure what the source of this is, but I assume pairing documents works more efficiently or does not lock up. Just a guess.

During the non-terminating scenario (I) all processor cores run under full load until pepper is stopped by keyboard interrupt. Progress updates are printed (but from what I can tell no progress is made, not entirely sure about that).