mesarcik commented 1 year ago

Description

We have made progress in data set creation, self-supervised anomaly detection and supervised anomaly detection. However several issues need to be addressed before this work is ready for publication.

Model fine tuning

Fine tuning is not working as well as expected, it seems that by fine-tuning the resnet the fine-tuning performance decreases when comparing it to training a resnet in a supervised manner on the full spectrograms
My suspicion is that it is due to the way we re-assemble patches after projection, i.e. some spatial relationships are discarded.
Unfortunately as we only have labels on a per spectrogram level, we need to reconstruct each latent embedding of each patch into its corresonding dimensionality
Each input is 256x256, it is broken into 256/n nxn patches, and then projected to some latent space dimensinoality
- e.g. patchsize = 64, latent dim = 128, then we have 16, 64x64 patches when projected down become a vector of size (16, 128)

Things to try:

[x] Experiment with the reshaping on the vector, currently we flatten the (16,128) -> (1, 2048)
[x] Apply an SVM to the latent projection to see if it does better than the MLP
- SVM does approximately the same as as the MLP
[x] Change the classification head to a CNN for example.

Outcome:

I dont think fine tuning makes sense
The reason it does not work is that we are:
- Destroying the spatial relationship between patches
- Only using the representations learnt from the "normal" data.
It seems that it decreases performance in comparison to just training a resnet on the full spectrograms
Furthemore in both cases we need some supervision, so technically we could just used the self-supervised model to detect anomalies and then detect the specific anomalies using the supervised resnet on the full data.

Finish results for URSI abstract:

[x] Make all three models evaluated on the same data

Label the last few examples in the dataset:

[x] ...

Supervised model:

Currently the simple resnet achieved pretty good performance, however it is failing on the high noise element class
I suspect that this is due to a number of incorrect labels, so i think i need to investigate which samples it is breaking on
[x] ...

mesarcik commented 1 year ago

Data inconsistencies:

It seems that some anomalies are present in the testing data even though they are clearly present in the labelling interface
This could be due to a number of factors that i need to check:
- [x] Difference in Colormap
- [x] Difference in normalisation
- [ ] The resizing to (256,256) causing unforseen problems

Example:

######### ---

mesarcik commented 1 year ago

Anomaly detection motivation:

In order to motivate the use of the KNN-based anomaly detector I want to show how the supervised model fails on unseen examples
This is similar to how we did it for the RFI paper, i.e. train on all classes except one and evaluate on all classes.
I should expect these models to to very poorly on unseen anomalies.
Furthermore, it seems that the supervised model is really bad at classifying normal samples,
This way by first doing the anomaly detection and then the classification we can be confidient that even if we have not properly classified something we can at least be sure we can detect it as an anomaly.
Furthermore, I expect this to improve the detection performance as the we would remove many of the normal classes from evaluation.

Removing classes from training Res-net:

To show that the resnet is really poorly suited to anomaly detection:

Model	Outlier	F1-Score
Resnet18	oscillating_tile	0.0228623685413808
Resnet18	real_high_noise	0.4137931034482758
Resnet18	electric_fence	0.0291704649042844
Resnet18	data_loss	0.1576480613549211
Resnet18	lightning	0.1273278475530532
Resnet18	strong_radio_emitter	0.5069060773480663
Resnet18	solar_storm	0.0696428571428571

Clearly the classifier is unable to detect unseen examples, this is why we need the SSL-model for anomaly detection

TODO:

[ ] Connect the SSL model with the classifier to see how the performance changes.
[ ] Subsample anomalous classes to get a more realistic distrubution of anomlous vs non anomalous data

mesarcik commented 1 year ago

Discussion

Subsampling results:

temp

Here we can see that subsampling clearly decreases the overal performance of the supervised detector
The way in which this is evaluated is that we train 10 different models with 10 different train-test splits
Then we subsample according to the table below 10 times with 10 different seeds and aggregate the results
The non-subsamples results are also evaluated 10 times with 10 different train -test splits

Class	Occurrence after subsampling	# Samples
oscillating_tile	0.01	4
real_high_noise	0.13	52
electric_fence	0.01	4
data_loss	0.06	24
lightning	0.05	20
strong_radio_emitter	0.2	80
solar_storm	0.02	8
normal	0.52	400

Connected SSL + Resnet:

On further inspection I dont think that this makes so much sense to connect these two methods unless that the normal class is being heavily misclassified
Below I show a table of which class is misclassified:

Class	oscillating_tile	high_noise	fence	data_loss	lightning	radio_source	solar	normal
oscillating_tile	-	2	0	2	0	0	0	0	0
high_noise	1	-	3	9	14	15	1	150
fence	1	7	-	0	5	0	0	11
data_loss	8	0	0	-	1	2	0	9
lightning	0	3	1	0	-	0	0	0
radio_source	0	20	0	11	9	-	0	23
solar	0	0	0	0	1	0	-	0
normal	89	5	5	2	38	0	0	-

From this two things are very clear: 1) High Noise elements, is poorly labelled, with the VAST majority of misclassifications coming from the normal class 2) The RFI in the normal class is being classified as an oscillating tile
Interestingly as a follow up experiment, I removed the high_noise_element class from the SSL detector and obtained perfect detection!
- Note the point about is not exactly correct, i think there was a slight bug with the testing, it seems that the F1 score without high-noise elements is about 0.93
Additionally, I ran some experiments with the supervised method when excluding the high noise class and the normal detection results increase by like 30%

TO-DO:

[ ] Determine which samples of the high noise elements class result in the SSL detector to drop in performance
[ ] Remove them from the testing set and re-evaluate the model.

Take 2

[ ] Experiment with data distillation techniques to remove classes that are similar to high_noise_element from the training set
[ ] After removing the 76 "normal" samples try find a more appropriate threshold such that we can better select the anomalous samples

mesarcik commented 1 year ago

Update

Im still not sure how exactly to handle the high_noise_element class
I think rather than being stuck on it i would like to check whether the general idea of preselection works
So to test this I have removed the class

Approach to evaluating preselection:

Use SSL model to only select what it think are anomalies and save it as a mask
Use mask to only select the anomalies to train the supervised resnet
Use this data to train the classifier
- We need to keep the test set the same.
Measure the different with preselection and not.

mesarcik commented 1 year ago

Testing anomaly detection with supervised models:

This is a sanity check motivation for the work
Here I train a supervised resnet in a single class context (i.e. treat all the anomalous classes as 1)
I obtain the following results:

Model	AUPRC	F1
Supervised (no high noise)	0.9477973202901427	0.93354943273906
SSL (no high noise)	0.9693446711123157	0.9236276849642006

Supervised (high noise)	0.9479283607170697	0.9023529411764706
SSL (high noise)	0.9446762864463896	0.9135408353285449

** Note: I evalute with != normal class, i expected it to be the same as == normal_class, but its not.

Seems that our unsupervised detector doesnt do as well as expected.
[x] Refractor training with new dataloader structure.

Confusion with class labels:

We train a supervised detector in a multiclass setting, with class labels between 0 and 5.
In this case 5 is the "normal" class.
Previously I was evaluating our self-supervised anomaly detector in the following way:
- ground-truth != 5 (how well the masked predictions correspond to the anomalous classes)
But in the case of the supervised classifier, i evaluate how well our labels correspond to each class.
So I wonder which is more correct? What are we trying to do? Detect anomalous or detect normality?

Distribution of classes for == vs !=

If we evaluate on == 5 then we are measuring how well we detect normality
If we evaluate on !=5 then we are measuring how well we detect anomalies
These metrics aren't exactly inverses of each other because the is class imbalance, there are almost 3x as many anomalies in the test data than normal data
When we subsample, then there is still imbalance just in the opposite direction.

Some results and discussion:

Below you can see what happens when we use the unsupervised mask during the multi-class classification problem.
Generally speaking, the unsupervised detector misclassifies around 70 normal samples and 70 anomalous samples (about 5% error)
This means that when we use it to mask the classifiers' about we get a few percent decrease in detection per class
The question is, does the unsupervised method make the system more useful?

mesarcik commented 1 year ago

Things to do:

Measuring joint performance:

[ ] I need to check how the performance changes on unseen anomalies of the supervised classifier when adding the SSL detector.

Improve SSL model performance:

Our SSL model seems to be having trouble distinguishing between normal data and anomalous data (the supervised method seems equally as able.
I have the following ideas on how to improve its performance:
- [ ] Somehow include RFI information into the model
  - Maybe have a network predict which pixels belong to an AOFlag mask
  - Maybe try remove RFI completely from the spectra without damaging the anomalies
- [ ] Change the model such that it better generalises (MAE, contrastive, etc)
- [ ] Make sure all test data lies within a unit-circle in the latent-dimension

Add classes to dataset:

As discussed with Marco, there are 3 extra feature I would like to add to the dataset:
High noise element:
- 1st order: high SNR feature
- 3rd order: low SNR
- Strong radio source:
  - 1st order: Source in the side lobes
  - 3rd order galactic plane
  - other class: scintillation
- Data loss
  - 1st order: more than 1 frequency band lost
  - 3rd order: 1 frequency band or time range lost

Write paper:

Pretty self explanatory.

mesarcik commented 1 year ago

Class removal tests:

I want to evaluate how well the supervised method works when removing specific classes (again)
Im doing it again as i better trust my evaluation now.
This will give motiviation as to whether we need to use the unsupervised detector or not.
It works.

dataloss

temp_data_loss

lightning

temp_lightning

oscillating tile

temp_oscillating_tile

high noise element

temp_real_high_noise

solar storm

temp_solar_storm

strong emitter

temp_strong_radio_emitter

Combination strategy:

Using the logical and operator (the same as multiplication) does not make sense, because if the SSL is confident that it is an anomaly and the supervised classifier is wrong then the output will also be wrong.
What we want is a demux operation, such that if the SSL detector thinks that the sample is an anomaly then it implies that it is infact and anomaly.
This also isnt the right idea, because the two outputs dont share that much information.
Basically if the SSL detector determines a particular sample to be

mesarcik commented 1 year ago

Hierachical results:

In these results (as reccomended by rob) we evaluate the performance of the cascaded/hierachical system
The way this works is: 1) Determine if sample is anomaly or not using SSL network 2) Evaluate using classifier which class it belongs to 3) If the first network determines the sample to be normal then set its output to the normal class
This is a pretty straight forward approach to connecting the models and it seems to give improvements.

temp

So it seems that using our method we better detect normal samples on average ~4 % without changing (0.03% decrease) the anomalous detection performance
The distinction is made here due to the class imbalance in the dataset

Analysis

It is infact hard to compare between bars here.
Take the case of oscillating tile, when we subsample it in the testing scenario we have ~400 normal samples and 3 oscillating tile samples.
This means that the misclassification of all the oscillating tile samples only accounts for a very small amount of the total anomalies.

Solution:

I need to improve the performance of the SSL detector such that it more consistently detects anomalies.
Weirdly it is currently on-par with supervised detector.
AFAIK there are two options available:
[ ] Figure out new architectural ways to improve anomaly detection
[ ] Remove testing sample that skew descision threshold
- This actually could be achieved through using the supervisesd training data.
- We could "sanitise" the normal samples
[x] Use retraining data from supervised detector to refine descision threshold of SSL Detector:
- I have tried this and it never worked for some reason, I added a 2-layer MLP to the end of the SSL-trained resnet, and allowed the Resnet's weights to be updated, however it seems to give worse performance than just used the distance based metric.
- I also tried using an SVM in the same way, but it also yeilded poor performance.

mesarcik commented 1 year ago

Accounting for RFI:

I had the idea of trying to mitigate misclassification of anomalies due to RFI
We train the same model with a RFI classification head, and then use its predictions to inform the computed distance matrix
Ideally, we would exclude large distances due to the stochasticity and high power of RFI
For example below is a normal class, its distance matrix, RFI prediction and AOFlagger mask
It can be seen that our anomaly detector incorrectly thinks the right-most patches are anomalous, however, the RFI mask (2nd last column) suggests that those patches belong to RFI.
The next example shows the predicitions for an oscillating tile input:
here is it clear that the RFI detector think many of features correspond to RFI but are infact an oscilating tile, the question is how to we combine these two predicitons?

mesarcik commented 1 year ago

Update:

Made backbone of SSL network ResNet-50, it takes extremely long to train
Refined labels to include to following classes:
- oscillating tile
- first order high noise
- third order high noise
- first order data loss
- third order data loss
- lightning
- galactic plane
- source in sidelobes
- solar storm

ROAD explanation

RAAD_modesls

mesarcik commented 1 year ago

Joining detector and classifier:

[ ] Try remove as much "normal" data before training the classifier.
- The classifier should not include the normal class, only the anomaly class
[x] Create an "unknown anomaly" class
- When the detector says the sample is anomalous but the classifier says it is normal replace the predicted label with an the unknown anomaly class

What are we trying it achieve with this?

Improve robustness for OOD sample:
- [x] I need to check in which class the classifier places OOD anomalies, because if it is always to the other anomalous classes, then our detector wont help anything.
Ability to detect unseen anomalies.
- Currently this is not possible because we are only changing the "normal" class.

Number of OOD samples misclassified as normal:

Class	# samples	osc	1 noise	3 noise	1 data	3 data	lightning	galactic	source in	solar storm	normal
oscillating tile	9	-	0	3	2	0	0	2	2	0	0	0
1st high noise	24	1	-	9	0	1	6	0	2	1	4
3rd high noise	24	1	2	-	1	0	8	0	2	1	9
1st data loss	49	0	0	0	-	49	0	0	0	0	0
3rd data loss	24	0	0	6	4	-	0	0	0	0	14
lightning	24	0	12	9	0	0	-	0	0	0	3
galactic plane	34	0	0	1	1	0	0	-	12	0	20
source in sidelobes	49	4	0	0	0	27	0	8	-	0	10
solar storm	49	0	0	46	0	0	3	0	0	-	0

Things to note about these results:

Data loss classes are easily mixed up (this makes sense)
Our current approach of flipping all missclassified normal samples will clearly not help this OOD problem, what we need to do is create a new "unknown" anomaly class where we place all ano

Results:

In this case we do the following to the anomaly masks: 1) If the detector says a sample is normal then set the classifier output to be normal 2) if the detector says a sample is anomalous and the classifier says it is normal, then assign sample to a new class called "unknown anomaly"
Here we use a single fine-tuned detector for all evaluation rather than retraining one for each OOD class
Average performance:
here we get about a 3% improvement across the board, which is pretty cool

Model	Normal F1	Anomalous F1
Supervised	0.91885 +- 0.01669	0.85912 +- 0.01717
SSL	0.93704 +- 0	0.88096 +- 0

Per OOD-class performance

temp

Discussion:

Currently due to the class imbalance the average performance that we are measuring here is getting very skewed by the high number of normal samples
Another more convincing metric will be evaluating how well we can "reclassify" the OOD anomaly class as an anomaly after performing the combination of classifier and detector.

mesarcik commented 1 year ago

Inconstistent results:

I find that running the same experiment twice gives in some discrepancy between the results, see plot below:
These are exactly the same experiments, the same data, same models, same splits, same seed for split, the only difference is the seed for the model initialisation (i think).
This been fixed!, there was a bug for a few weeks in the code.

temp_solar_storm temp_solar_storm_improved

mesarcik commented 1 year ago

Current evaluation:

Note: F1 score has now become f2 score as it is more sensitive to the anomalies (precision)

There are two approaches I have tried for the recombination of the anomaly detector and classifier. 1) If the detector determines a sample to be anomalous and the classifier classifies it as normal then assign it to a new class of unknown anomalies 2) Do the above and if the detector classifies a sample as normal then assign it to be normal
In the first case I find that we effectively improve the detection of the normal class without effecting the classification performance. There is about a 1% increase in performance in detecting the normal class
In the second case we get more interesting behaviour we get an average increase across all classes of 1%
- This result can be interpreted as that the additional anomaly detector not only improves the detection of unseen anomalies but also of the overall inlying class classification.

I still need to measure the performance increase in detecting anomalies.

[] need to update evaluation to work for multiple anomalies in the same spectrogram

mesarcik commented 1 year ago

Experiments to set up:

[x] Evaluation of performance increase on only OOD classes
- Is it the one class classification problem? I.e. normal vs the OOD class?
- Is it given all anomalies and all normal data, how well can it distinguish the OOD class?
[x] Inclusion of spectrograms which contain multiple anomalies
[ ] Evaluate whether more pretraining incrases performance
[x] Evaluate training noise of the SSL model, it seems that mulitple runs have significant variation
[x] Include newly labelled data in training
[ ] Determine whether different normalisation schemes (percentile clipping) effect SSL performance
[ ] Include gradient clipping in the SSL training
[ ] Evaluate if scaling up ResNet backbone effects SSL performance
[ ] Measure performance as we vary the number of layers in the MLP classification head
[ ] Compare SSL pretraining with using the full spetrogram and finetuning
[x] Test with ConvNext backbone
Refactoring:
[ ] Add fine tuning code and evaluating into main codebase.

OOD detection result:

Here we evaluate how well can the model only detect the OOD class (i.e. we mask out only the OOD class during evaluation

OOD Class name	Supervised	Supervised + SSL	Supervised + SSL (random Resnet weights)	Supervised + dists
oscillating_tile	0.272	0.643	0.384	0.531
first_order_high_noise	0.58	0.417	0.434	0.49
third_order_high_noise	0.473	0.567	0.447	0.212
first_order_data_loss	0.103	0.73	0.145	0.932
third_order_data_loss	0.105	0.236	0.127	0.235
lightning	0.49	0.868	0.251	0.653
galactic_plane	0.491	0.592	0.363	0.325
source_in_sidelobes	0.724	0.699	0.634	0.417
solar_storm	0.837	0.895	0.448	0.774

Given positive results, what is still needed to complete this work:

It seems that there is sufficient evidence to suggest we have completed the three main goals of this work: 1) Construct a dataset comprising of commonly occurring anomalies in the LOFAR telescope 2) Classify known anomalies with a fairly high accuracy 3) Detection unknown anomalies with a low error rate
This has been achieved by using an off-the-shelf (COTs?) classifier to detect anomalies at the spectrogram level
As well as a anomaly detector that learns to better distinguish between inlying and anomlous classes,
- We do this by using SSL-pretraining approach to generalise to unseen classes
- and fine-tuning on some of the labelled data
This has resulted in an anomaly detection system that offers superior performance than the sum-of-its-parts.

Experiment-related

[ ] Amount of data needed to train the SSL and fine tuning model
[x] Re-do all experiments with the new dataset
[ ] Given two similar classes (high noise element and oscilating tile) how well does the model distinguish them.
[ ]

Dataset-related

There is more data to label, but i feel what we have is sufficient
I need to recompute the contamination percentages, but it might be a little challenging as I have been significantly more picky with the anomalous examples

Class	# Samples (exclusive)	# Samples (inclusive)
oscillating_tile	50	61
first_order_high_noise	65	75
third_order_high_noise	111	153
first_order_data_loss	168	169
third_order_data_loss	209	342
lightning	323	402
galactic_plane	249	590
source_in_sidelobes	165	456
solar_storm	147	147
normal	7413	n/a

Paper-related

Start writing
Decide on venue

mesarcik commented 1 year ago

New dataset:

we need to evaluate both the classification performance and the anomaly detection performance in this work.

Classification performance evaluation:

The way we evaluate classification performance is by measuring the per class F2 score with and without the SSL detector.
Ideally we do not want to see any degredation in performance of our model

Unseen anomaly performance evaluation:

Remove a particular class or classes and see how well we are able to classify the outlying class vs normal data
This is maybe a slightly biased approach as there are still other anomalous classes (but this should be caught by the other model

Any other metrics that are meaningful to compare:

The computational efficiency?

Results

TL;DR seems like new dataset resolves some problems but also introduces some others:
- It seems as though the detector thinks the third-order high noise element is normal, resulting in large peformance drops across the board.
Below we can see the results from a Resnet-50 backbone on the new dataset revision, it is clear that in most cases our anomaly detector boosts the per class performance of the classifier. However it is also clear that the classifier does not have the best performance.
I am now running a test with the third_order_high_noise class removed to see how that effects the overall performance
Without third order high noise

mesarcik / ROAD

2023 #12

Description

Model fine tuning

Things to try:

Outcome:

Finish results for URSI abstract:

Label the last few examples in the dataset:

Supervised model:

Data inconsistencies:

Example:

Anomaly detection motivation:

Removing classes from training Res-net:

TODO:

Discussion

Subsampling results:

Connected SSL + Resnet:

TO-DO:

Take 2

Update

Approach to evaluating preselection:

Testing anomaly detection with supervised models:

Confusion with class labels:

Distribution of classes for == vs !=

Some results and discussion:

Things to do:

Measuring joint performance:

Improve SSL model performance:

Add classes to dataset:

Write paper:

Class removal tests:

dataloss

lightning

oscillating tile

high noise element

solar storm

strong emitter

Combination strategy:

Hierachical results:

Analysis

Solution:

Accounting for RFI:

Update:

ROAD explanation

Joining detector and classifier:

What are we trying it achieve with this?

Number of OOD samples misclassified as normal:

Things to note about these results:

Results:

Average performance:

Per OOD-class performance

Discussion:

Inconstistent results:

Current evaluation:

I still need to measure the performance increase in detecting anomalies.

Experiments to set up:

Refactoring:

OOD detection result:

Given positive results, what is still needed to complete this work:

Experiment-related

Dataset-related

Paper-related

New dataset:

Classification performance evaluation:

Unseen anomaly performance evaluation:

Any other metrics that are meaningful to compare:

Results

Without third order high noise

Without `third order high noise`