kitzeslab / opensoundscape

Open source, scalable software for the analysis of bioacoustic recordings
http://opensoundscape.org
MIT License
126 stars 14 forks source link

one_hot_clip_labels is slow #770

Open sammlapp opened 1 year ago

sammlapp commented 1 year ago

Pandas gives PerformanceWarning about highly fragmented dataframe

louisfh commented 11 months ago

one_hot_clip_labels also raises

SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

Let's fix both at the same time.

sammlapp commented 7 months ago

This may be one of those places where we just need to avoid adding rows to a df inside a loop - instead we should only use one command to join dfs or create a df from an array after the for loop

sammlapp commented 5 months ago

That was not the only slow part, turns out. Requires more investigation

louisfh commented 3 weeks ago

The slow part of BoxedAnnotations.one_hot_clip_labels is actually the call to BoxedAnnotations.one_hot_labels_like.

This function is the slow one.

louisfh commented 3 weeks ago

I think the problem is that for each time interval (row) in the desired clip_df, the current workflow evaluates what the overlap is for every annotation in the annotations dataframe. i.e. if there are 10^2 annotations in the annotations_df, and you want to output the 'one-hot' labels for 10^4 time intervals, there are 10^4 * 10^2 calls to the function`overlap([time_window_start, time_window_end], [annotation_start, annotation_end]).

A better implementation would be to iterate through the annotations, and find out which rows in clip_df it will overlap, then only set those rows.

# pseudocode
output_df = make_clip_df(filename, clip_duration)
rows_to_be_changed = []
for annotation in annotation.df:
   overlaps = find_overlapping_time_windows(annotation, output_df)
   rows_to_be_changed.append(overlaps)

for row in rows_to_be_changed:
    output_df.loc[row.index] = row.values
sammlapp commented 3 weeks ago

nice find! keep in mind that annotations are for a single class and there can be multiple classes annotated for one clip.

sammlapp commented 2 days ago

does this also resolve #412 ?