class MultiProcessRecording drops various recording methods

mrichar1 commented 1 year ago

When using DirectoryMultiProcessingAnalyzer, it's not clear why it internally uses the MultiProcessRecording class. This seems to just return a subset of methods/attrs of a passed-in Recording instance (config, detections, path etc), meaning that access to other Recording methods is lost, e.g. extract_detections_as_audio.

Rather than iterating over the list of Recording instances, generating a list of MultiProcessRecording instances, would it not be better to just return the initial list, as happens in DirectoryAnalyzer ?

I'm happy to work out a PR for this, if you think the above suggestion is a good idea.

joeweiss commented 1 year ago

Hi @mrichar1, apologies for the delay in responding.

The MultiProcessRecording class exists because the Pool can't pass objects, like theRecording instances, to the threads and back. The data has to be pickled or serialized. Currently, the code is passing dicts around for pooled threads.

Question: at what point do you need the extraction functionality during your analysis?

If you're trying to add the extract_ functionality as part of the multi-process, then that functionality would need to be added to the process_from_queue method. I think that was my original plan, but I never had a use case and didn't settle on an interface. Personally, I use a database as a queue and run multiple processors in single threads (usually 8 on a machine at a time).

If you're ok with those extraction functions being run in a single thread post-analysis, then this is probably an easier issue to solve. MultiProcessRecording could be changed to be a subclass of RecordingBase. Again, those extract_ methods wouldn't be multi-threaded, but it would provide the interface you're looking for ... depends on your needs.

I'm happy to further discuss or review a PR if you're interested.

Joe

mrichar1 commented 1 year ago

Hi,

Thanks for getting back to me and explaining things.

I think for my use case (and probably more generally) it would be sufficient for the class to return 'more complete' Recording instances with the extract_ methods, for these to be run separately after the multi-process.

I'm wanting to look at the recording metadata and extract audio/waveforms only for those with a high confidence interval, or just once for each species from each set of recordings. As you say, designing an interface incorporating this kind of logic into the multi-process might be awkward - while you could accept a function to call against each recording, capturing the 'results' from that function and returning them in a standard way could prove tricky!

For my use case I suspect single-thread post-analysis will be sufficient, as I'm likely to be discarding the majority of recordings anyway. Also there's nothing stopping me from using a separate Pool to do the extractions in parallel if needed.

So yes, making MultiProcessRecording a subclass of RecordingBase seems like a sensible way to do it - let me know if this is something you can do, or if it would be easier for me to try a PR for review?

joeweiss commented 1 year ago

So yes, making MultiProcessRecording a subclass of RecordingBase seems like a sensible way to do it - let me know if this is something you can do, or if it would be easier for me to try a PR for review?

I've got a PR in progress for an unrelated multiprocessing improvement, so I'll take care of this issue as part of that work. I'll merge that up in the next hour or so.

joeweiss commented 1 year ago

This has been released as 0.12.0. Thanks for the posting the issue and sharing your use case!

joeweiss / birdnetlib

class MultiProcessRecording drops various recording methods #88