kitzeslab / opensoundscape

Open source, scalable software for the analysis of bioacoustic recordings
http://opensoundscape.org
MIT License
129 stars 14 forks source link

automatic self-profiling in train and predict #902

Open sammlapp opened 8 months ago

sammlapp commented 8 months ago

CNN or Preprocessor could offer a way to profile the speed of

sammlapp commented 3 months ago

maybe instead of happening automatically during .train(), we can implement results_dictionary=CNN.profile(samples) so that the user can decide when to profile and can easily add it to a notebook/script

here's an example of some profiling

from opensoundscape.sample import AudioSample

# profile preprocessing of one sample, check the amount of time taken by each preprocessing step
m.preprocessor.pipeline.overlay.set(overlay_prob=1)
s = AudioSample.from_series(train_df.iloc[0])
s.labels = s.labels.astype(bool)
m.preprocessor.pipeline.overlay.overlay_df = (
    train_df.sample(3000).astype(bool).reset_index()
)
s = m.preprocessor.forward(s, profile=True)
s.runtime

# preprocess a bunch of samples in batches with a dataloader, n_workers>1
dl = m._init_train_dataloader(
    train_df.sample(3000).astype(bool),
    batch_size=32,
    num_workers=16,
    raise_errors=True,
)
ds = dl.dataset.dataset

from time import time as timer
from tqdm.autonotebook import tqdm

t0 = timer()
batch_times = []
for i, batch in enumerate(tqdm(dl)):
    if i >= 40:
        break
    batch_times.append(timer() - t0)
    t0 = timer()
print(
    f"batch loading time: mean {np.mean(batch_times):.02f} max {np.max(batch_times):.02f}"
)

# dataset, but no batching or parallelization with dataloader
t0 = timer()
prep_times = []
d = ds.sample(n=100)
d.label_df = d.label_df.astype(bool)
for s in tqdm(d):
    prep_times.append(timer() - t0)
    t0 = timer()
print(
    f"sample loading time: mean {np.mean(prep_times):.02f} max {np.max(batch_times):.02f}"
)

could also profile the forward and backward pass speed of the network