justinsalamon / scaper

A library for soundscape synthesis and augmentation
BSD 3-Clause "New" or "Revised" License
377 stars 55 forks source link

redundant unnecessary repetitive calling of _validate_source_file() in nested loops #139

Open Moadab-AI opened 3 years ago

Moadab-AI commented 3 years ago

I am using the latest version of the repository (1.6.4) and when generating soundscapes I noticed a prohibitive slow generation of soundscapes in a loop. here's what I found out:

So following the example in the documentation : https://scaper.readthedocs.io/en/latest/examples.html in the add_event() method when I specify a list of source files for event (instead of []), every time it wasnt to add an event it will go through validating the entire list of source files because of calling "_validate_source_file()" in "_validateevent()" ! this is extremely redundant and unnecessary because Instead of doing so only "one" necessary time it will do it : {number of soundscapes times number of_added_event_per_soundscape} times. particularly if the list of event source files is long (as in my case).

The same argument of course applies when _validate event is called in add_background. So I believe the _validate_source_file must be moved out of all loops and be just called once for the list of background files and source files through a separate method outside of all the loops. I am just commenting them out in my case.

Also similary if you dont specify the list of source files in the add_event or add_background and go with [] instead, this redundant file checking happens through another function in utils.py called : "_get_sorted_files()" where in the last line of the method is written: files = [f for f in files if os.path.isfile(f)]

this line is again is a huge unnecessary overhead and should be performed outside all the loops and only once for all background and source files.

justinsalamon commented 3 years ago

The reason every distribution tuple is validated when an event is added is that it can be different each time.

In your case you always add the same list of source files, but in general each time you add an event you could specify a different list of source files (or a different distribution tuple altogether) meaning we must validate the distribution tuple every time we call add_event.

I we don't validate the source files when an event is added it could cause Scaper to crash downstream. So from a design standpoint I believe what we are doing is the correct thing to do.

We could, potentially, add a flag to add_event to disable source file validation (e.g. disable_source_file_validation=False by default), since this is the only validation that involves file I/O - I believe all the other validation steps are insignificant computationally.

@moabaom To help us better understand how serious an issue this is, can you share a Minimal Working Example (MWE) and report the execution time with and without source file validation? Thanks!

justinsalamon commented 3 years ago

p.s - actually another option would be to create a "cache" for validated files, such that if a scaper object has already validated that a file exists, it doesn't try to validate it again. This is probably safer than disabling validation and achieves the same speedup. If you create a new scaper object in a loop (like in the example), we could support providing a list of validated files as input, and the scaper object will not try to validate these source files.

Still, let's start with the MWE and timings to better scope the problem. thanks.