Closed mrichar1 closed 1 year ago
Hi @mrichar1, apologies for the delay in responding.
The MultiProcessRecording
class exists because the Pool
can't pass objects, like theRecording
instances, to the threads and back. The data has to be pickled or serialized. Currently, the code is passing dicts around for pooled threads.
Question: at what point do you need the extraction functionality during your analysis?
If you're trying to add the extract_
functionality as part of the multi-process, then that functionality would need to be added to the process_from_queue
method. I think that was my original plan, but I never had a use case and didn't settle on an interface. Personally, I use a database as a queue and run multiple processors in single threads (usually 8 on a machine at a time).
If you're ok with those extraction functions being run in a single thread post-analysis, then this is probably an easier issue to solve. MultiProcessRecording
could be changed to be a subclass of RecordingBase
. Again, those extract_
methods wouldn't be multi-threaded, but it would provide the interface you're looking for ... depends on your needs.
I'm happy to further discuss or review a PR if you're interested.
Joe
Hi,
Thanks for getting back to me and explaining things.
I think for my use case (and probably more generally) it would be sufficient for the class to return 'more complete' Recording
instances with the extract_
methods, for these to be run separately after the multi-process.
I'm wanting to look at the recording metadata and extract audio/waveforms only for those with a high confidence interval, or just once for each species from each set of recordings. As you say, designing an interface incorporating this kind of logic into the multi-process might be awkward - while you could accept a function to call against each recording, capturing the 'results' from that function and returning them in a standard way could prove tricky!
For my use case I suspect single-thread post-analysis will be sufficient, as I'm likely to be discarding the majority of recordings anyway. Also there's nothing stopping me from using a separate Pool
to do the extractions in parallel if needed.
So yes, making MultiProcessRecording
a subclass of RecordingBase
seems like a sensible way to do it - let me know if this is something you can do, or if it would be easier for me to try a PR for review?
So yes, making
MultiProcessRecording
a subclass ofRecordingBase
seems like a sensible way to do it - let me know if this is something you can do, or if it would be easier for me to try a PR for review?
I've got a PR in progress for an unrelated multiprocessing improvement, so I'll take care of this issue as part of that work. I'll merge that up in the next hour or so.
This has been released as 0.12.0. Thanks for the posting the issue and sharing your use case!
When using
DirectoryMultiProcessingAnalyzer
, it's not clear why it internally uses theMultiProcessRecording
class. This seems to just return a subset of methods/attrs of a passed-inRecording
instance (config
,detections
,path
etc), meaning that access to otherRecording
methods is lost, e.g.extract_detections_as_audio
.Rather than iterating over the list of
Recording
instances, generating a list ofMultiProcessRecording
instances, would it not be better to just return the initial list, as happens inDirectoryAnalyzer
?I'm happy to work out a PR for this, if you think the above suggestion is a good idea.