Open sammlapp opened 1 year ago
one_hot_clip_labels also raises
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Let's fix both at the same time.
This may be one of those places where we just need to avoid adding rows to a df inside a loop - instead we should only use one command to join dfs or create a df from an array after the for loop
That was not the only slow part, turns out. Requires more investigation
The slow part of BoxedAnnotations.one_hot_clip_labels is actually the call to BoxedAnnotations.one_hot_labels_like.
This function is the slow one.
I think the problem is that for each time interval (row) in the desired clip_df, the current workflow evaluates what the overlap is for every annotation in the annotations dataframe. i.e. if there are 10^2 annotations in the annotations_df, and you want to output the 'one-hot' labels for 10^4 time intervals, there are 10^4 * 10^2 calls to the function`overlap([time_window_start, time_window_end], [annotation_start, annotation_end]).
A better implementation would be to iterate through the annotations, and find out which rows in clip_df it will overlap, then only set those rows.
# pseudocode
output_df = make_clip_df(filename, clip_duration)
rows_to_be_changed = []
for annotation in annotation.df:
overlaps = find_overlapping_time_windows(annotation, output_df)
rows_to_be_changed.append(overlaps)
for row in rows_to_be_changed:
output_df.loc[row.index] = row.values
nice find! keep in mind that annotations are for a single class and there can be multiple classes annotated for one clip.
does this also resolve #412 ?
Pandas gives PerformanceWarning about highly fragmented dataframe