tobac-project / tobac

Tracking and object-based analysis of clouds
BSD 3-Clause "New" or "Revised" License
100 stars 53 forks source link

renumbering of features in the segmentation output dataframe in RC_1.5.0 #296

Closed itinderjot closed 1 year ago

itinderjot commented 1 year ago

It seems like the segmentation output renumbers the features instead of retaining them from the tracking dataframe that I give it as one of the inputs. I've attached the segmentation output dataframe tracking_output_multithreshold.txt segmentation_output_df_threshold_5.txt for a timestep and tracking output for the same timestep.

w-k-jones commented 1 year ago

Thanks for reporting this @itinderjot

Could you post the code that produced this output? I haven't yet been able to reproduce this problem in some quick testing

itinderjot commented 1 year ago

Yes, sure! Here's the function I'm using and the tobac RC_1.5.0 segmentation.py module that I used: perform_3Dsegmentation_on_file.txt segmentation.txt

w-k-jones commented 1 year ago

Great thanks, I've found the problem! The problem is actually in tobac.utils.combine_tobac_feats renumbering the output features, but there's a feature to disable this. If you change the line in your function combine_tobac_feats to: return tobac.utils.combine_tobac_feats(list_of_feats, preserve_old_feat_nums="original_feature") it should preserve the original feature numbers from the tracking dataframe in a new column called "original_feature", even if you only combine a subset of the segmentation steps.

Let me know whether this works, so we can see if this issue is resolved or not 🙂

w-k-jones commented 1 year ago

@freemansw1 the preserve_old_feat_nums was a little tricky to understand, it's possible that we should revisit this in future

JuliaKukulies commented 1 year ago

It would probably make sense to make it the default to preserve the feature numbers as this would be more intuitive and I guess the more common use case. What do you think @w-k-jones @freemansw1 ?

freemansw1 commented 1 year ago

@freemansw1 the preserve_old_feat_nums was a little tricky to understand, it's possible that we should revisit this in future

Agreed.

It would probably make sense to make it the default to preserve the feature numbers as this would be more intuitive and I guess the more common use case. What do you think @w-k-jones @freemansw1 ?

I will push back on this a bit; if you break apart feature detection by time, preserving the feature numbers could (will) result in non-unique feature numbers. I'd be happy with a compromise where the feature numbers are preserved by default, but an error is raised (and output not given) if there is a non-unique feature number.

Honestly, the long-term best practice would probably be to switch segmentation (or at least give the option to switch) from using the feature number to using idx, which along with frame is probably a better identifier (and both are theoretically unique).

JuliaKukulies commented 1 year ago

Ah OK yes that is a good point @freemansw1 - I haven't thought about the uniqueness, although I actually had this issue in my personal code and did not realize it right away!

I just think that the main reason to use this feature is probably when the dataset is too large to run the feature detection and segmentation all in one go. This means it is likely that your segmentation output has the old feature numbers.

Just clarifying that I do think it is definitely good to have the option to turn this on and off. By default I meant to instead of having None (no preserve) as the default parameter, we could have the preservation turned but throw a warning? An error makes less sense to me because then it would not be possible anymore to preserve the old feature numbers. Imagine you run your feature detection and segmentation on monthly data for several years, you definitely get a lot of non-unique numbers.

But yeah making it the default and an option to turn that would mean that the user cannot define the name of the new column name. So I see why you implemented it the way you did and thinking some more about it maybe it is best to leave it and just clarify in the documentation.

And totally agree with your suggestion to use frame and idx in the segmentation in the long term!