mesarcik / ROAD

The Radio Observatory Anomaly Detection (ROAD)

MIT License

1 stars 0 forks source link

Experimentation #2

Closed mesarcik closed 1 year ago

mesarcik commented 2 years ago

Dataset information

I have mostly finished labelling the LOFAR anomaly detection dataset, there are certainly going to be some inconsistencies, but for now it is sufficient
Classes such as "High_Noise_Elements" are ill-defined, but these can be either discarded or including into the "non-anomalous class"
Most recent labels can be found here: project-37-at-2022-08-19-10-07-81c0ef13.json.zip

Dataset overview

1) Note, these values are changing as more corrections are made to each class _2) Note, the class labels are not necessarily correct, for examples, what I labelled strong radio emitter is an A-team source in the sidelobes__

Class	# Samples
Other	6953
scintillation	2991
strong_radio_emitter	2444
unknown	976
high_noise_elements	790
lightning	727
data_loss	462
solar_storm	147
electric_fence	142
oscillating_tile	82
empty	73

Labelling interface/the way things are plotted

I make use of Jorrit's code to produce the "Adder Plots"
We use only the magintude spectrum but for all 4 polarisations
Each plot is normalised based on the 1st and 99th percentile across a given sap
To improve dynamic range, a 3rd degree polynomial is fit to the each time slice (1 subband) and then the polynomial is divided through to decrease the dynamic range issues
NOTE the code was written in python2.7 and there is a bug in its such that the station numbers do not correspond to the correct baseline.

Example:
The domain that we labelled in (with polynomial normalisation)
Unprocessed data

Potential directions

We have very imbalanced classes (this is expected in anomaly detection scenarios)
We could technically make a classifier for the classes with enough data (scintillation and strong radio emitter), but for the others such as oscillating tile we would either need to collect more data or use novelty detection/clustering

Option 1:

Train a classifier per class and measure the accuracy on some mixed test set
The classifier would have to be multi-class as a single spectrogram can have multiple classes present
Alternatively, we could take the top-N accuracy of a single classifier to determine which classes are the most likely to be present

Option 2:

Use AE-based novelty detection schemses like NLN
My suspicion is that their success will be limited due to training on the MSE will not create meaningful blurry reconstructions
Furthermore, integrating each spectrogram/patch to give an anomaly score will most likely produce many false negatives.
It is easily to evaluate given the amount of work i've put into it, but i dont think they are the best option

Option 3:

Self-supvervised approaches such as predicting the baseline length or some other subtask may produce meaningful representations for our data.
I think many of the features (lightning, oscilating tile, etc) will not be reflected in an embedding generated by baseline distance (actually its not even possible because we are currently only using autocorrelations
What about training a classifier on a subset of the data and then using it to create an embedding for unseen classes?

Task
Predict Time/Frequency range of each patch
Predict station each patch corresponds to
Predict polarisation
Predict antenna locations
Patch locations

mesarcik commented 2 years ago

Preliminary experiments

Train basic AE that is too shallow with 2 latent dimensions on a full 256x256 spectrogram
The embedding doesnt look particularly meaningful
In the plot below it seems to learn the structure of the training data, so vertical lines are in the top left quandrant, and horizonal lines are the top right quadrant
I think training on a patch level will make models learn a little more semantic information rather than the orientation of the dominant feature
I think we also need to apply the polynomial normalisaiton to the data as a preprocessing step, because as shown in the previous post there are really big differences in the appearance of the data.

mesarcik commented 2 years ago

Self-supervised learning example

Predict frequency band and polarisation for each patch

Method

Read in dataset while maintaining frequency band information
Break each baseline into, say, 32x32 patches
Train a ResNet to predict the start and end frequency ranges as well as the polarisation
Use the embedding generated from this loss to perform anomaly detection (or maybe some clustering)

mesarcik commented 2 years ago

Problems with normalisation

I noticed that the plots I generated are normalised in a strange way
A 3rd degree polynomial is fit to each subband (time-series) and then subtracted from it to remove the high amplitude features, below is a comparison of a few baselines to illustrate that it doesnt seem to be much of a problem.
Also note that I the normalised plots are raw magnitudes whereas the non-normlised ones are in dB

Low band with data loss

Normalised

L774495_SAP000_CS011LBA_autocorrelation_SB_vs_time_normalized_amplitude

Not

L774495_SAP000_CS011LBA_autocorrelation_SB_vs_time_normalized_amplitude png - norm

High band

Normalised

L790158_SAP000_RS307LBA_autocorrelation_SB_vs_time_normalized_amplitude

Not

L790158_SAP000_RS307LBA_autocorrelation_SB_vs_time_normalized_amplitude png - norm

Uncertain features

It seems that the features that i was uncertain about are actually an artefact from the polynomial subtraction

L784965_SAP001_CS004LBA_autocorrelation_SB_vs_time_normalized_amplitude png - norm L784965_SAP001_CS005LBA_autocorrelation_SB_vs_time_normalized_amplitude

mesarcik commented 2 years ago

Validate labels:

Go through each catergory and ensure labels are correct

TODO:

[x] data_loss
[x] empty
[x] solar_storm
[x] electric_fence
[x] lightning
[x] oscillating_tile
[x] unknown
[ ] scintillation
[x] strong_radio_emitter
[ ] high_noise_elements

Notes:

empty = dataloss
For the time being i will ignore high_noise_elements and scintillation, as I am not confident in the labels

mesarcik commented 2 years ago

Sample of the dataset clipped at 99th percentile

temp

mesarcik commented 2 years ago

Initial results

Trained a VAE on varying patch sizes and calculated the anomaly detection performance per class of the dataset
Below i plotted the AUROC scores for each class and the table shows the performance when LD=8, patch_size=16, k=8 only using KNN-based error (i.e. no reconstruction)
The test set is made up of equal parts of the anomalous and non-anomalous classes, i.e. if lightning has 80 labels, then we will select the last 80 samples from the training data and use that for testing
- This means that there is overlap between the samples used when testing.
- In the future i should fix the training data, but must also consider the class imbalanced behaviour.

AUROC

Class	AUROC	AUPRC	F1
oscillating_tile	0.8274	0.8877	0.8777
electric_fence	0.2860	0.4553	0.6686
data_loss	0.5732	0.6357	0.6985
lightning	0.3430	0.4748	0.6447
strong_radio_emitter	0.6264	0.2634	0.3708

Discussion

It works just as well as i expected
I think i need to consider the reconstruction error as well, in order to debug the model's performance
If that doesn't work then look into the dataset, there might be some mislabelled examples in the "normal" class

Reconstructions of KNNs

reconstruction dont seem to be different enough to use them as a measure of anomalousness (as i expected)
It also seems that the model is very easily generalising to unseen features (in this case oscilating tile)
this is probably because the patchsize is too small

temp

mesarcik commented 2 years ago

Using SSL

Here I train a Resnet to predict the frequency range per patch (a tuple of the start and stop frequencies)
The model seems to learn something, but i think the dimensionality (2) is too low for it to learn very useful representations

Training:

The loss seems to converge quite quickly, which may imply there is not enough of a training signal
There are 3 distinct clusters in a sample of the training data

Results:

Similar to the AE, it doesnt seem to work that well
Its a very similar trend, with the best performance on high noise elements and the worst on strong radio source

_temp temp

Embedding for Dataloss

epoch_embedding_0

The manifold in 2 dimensions looks like an "L"
It seems that some of the anomlaous data lies off the manifold, however, there isn't a strong enough difference between the samples on and off the manifold, we need to learn to increase this by either increasing number of dimensions or creating stronger training signals

TODO

Inspect the embeddings of the anomalous class to see if there is a clear problem with the learn representations
Determine if the poor performance is a data problem or model problem?
Add more dimensions to the SSL model, i.e. polarisation, station location, constrastive samples? SVDD samples?

mesarcik commented 2 years ago

Update after 1 week

Went through the training data and removed all outliers
In doing so the performance seemed to increase across all classes

1) Trying to determine the effects of Multi-class/single-class 2) The differences in performance between representations learnt from resnet, vae with and without training 3) The effect of evaluation metric on performance measurement

Multi-class detection

temp_MISO

seems that VAE and resnet have similar performance
using only F1 score its hard to determine which is performing best

Single Class

it seems that using F1-score it distorts the performance
we can see that if we randomly initalise the weights of both the VAE and the resnet we can still pretty well detect "anomalies"
This might be due to the averaging
However when we look at the AUORC and AURPC we get quite a different picture

Open questions:

which metric is the most meaningful?
How do we compose the test set? Single outlier, multiple outliers? Equal ratio of anomalous to non-anomalous?
which model gives the best represenations?
- There is still more work to do using the resnet.
- We are only currently using the start/stop frequencies as the "labels"
remove RFI first?

mesarcik commented 2 years ago

TODO

Check the embeddings of the good vs bad data to see why we are getting such good performance
Validate the reconstruct_distances is working correctly
Maybe add an SVDD loss to ensure that all good data lies inside a unit circle.

Weird results:

Below is a table of the performance of just taking the average value per patch and then taking the average of those averages
Clearly something is up, i think this is due to the ratio of anomalous to good samples as well as potentially the selection of data

Anomaly Type	AUROC	AUPRC	F1
oscillating_tile	0.1715	0.6866	0.8780
electric_fence	0.8257	0.9395	0.8852
data_loss	0.4460	0.6216	0.7984
lightning	0.5778	0.6737	0.8147
strong_radio_emitter	0.3396	0.5647	0.7899
solar_storm	0.9457	0.9018	0.9800

mesarcik commented 2 years ago

Changed encoding scheme

Made the encoding scheme follow that used by cifar etc

Station names

For trying to learn station names the validation accuracy was fixed to 0.2, while the model continued to learn
- this resulted in very poor performance when using it for anomly detection
I think this is because the station numbers do not well describe the contents of each image

Frequency range:

Model accuracy quickly reached 0.99, not sure if this is a bug or not (maybe I need to a better way to evaluate validation loss)
The performanace was sificantly better for most features, but not amazing (similar performance to VAE)
I think the frequency-band based features are not discriptive enough to learn that much
Maybe i can change the way we evaluate anomalies, rather than using only the latent distances, we can consider the predictions of the model, and if they are incorrect then the sample is probably an anomaly?

Polarisations:

It doesnt seem that the polarisation information is sensitive enough to use for anomaly detection
we get a performance decrease relative to the frequency band prediction as well as the

Different anomaly detection evaluation:

I evaluated the anomaly detection performance using the inverse of frequency based predictions
- I.e. we predict which frequency range a patch corresponds to, and if it is incorrect then its an anomaly
- However this never seemed to work that well (better than a Nearest-Neighbour based loss) but not muuch better
Here it seems that the model is pretty good at detecting which frequency band the anomalies are from even without anomalies

What other self-supervised labels can we use?

It seems like individually the sub-tasks that I have been trying to train with do not enable the model to learn that much
My guess is that I would need to combine several of them to present a harder challenge to the model
The difficulty there is that we would need to create and encoding that encompases all combinations of polarisation, frequency-band and station location (maybe rather than station number)
In this case there are 92 different bands when we have a patch size of 32x32, we have 4 different polarsations, and say 10 different station locations, then we have 3680 different possible outputs
Maybe we can just limit this to central frequency of each patch and polarisation ?

Other options:

evaluate the supervised performance of each class
possible try out an in-painting loss within the VAE (?)
investigate intergrate(), make sure its reassmbling all patches correctly

mesarcik commented 2 years ago

Comparison between distance based metric and frequency band information:

Table below showing the F1 score of each method:

Anomaly	Pixel-mean	VAE-dist	Res-dist	Freq-dev	Supervised	Location prediction (patch size 64x64)
oscillating_tile	0.1891	0.5793	0.7972	0.7296	0.4669	0.71428
electric_fence	0.2914	0.2988	0.3799	0.3647	0.4669	-
data_loss	0.4867	0.6491	0.6335	0.7034	0.4669	0.6691
lightning	0.6714	0.6515	0.6595	0.6568	0.4816	0.7062
strong_radio_emitter	0.7112	0.7689	0.8151	0.7647	0.9252	0.8575
solar_storm	0.7989	0.9355	0.5860	0.9295	0.5498	0.8211

It doesnt seem that frequency information per patch is the right way to go.
Seems like either supervised training is not ideal, or something is incorrect with the evaluation

Location prediction (spatial context prediction)

The issue with using frequency band information for learning about the patches is that we disregard time/order information (see example below)
To solve this we also need to encode the time-dimension into the "label"
I have tried to do this by giving each patch location a label (illustrated below) and then trying to predict which position each patch is from. The results seem more promising than expected, however there are number of limitation

Issues with current implementation

The current loss enforces all patches from, say, the top left corner to be assigned the same label, this is obviously incorrect
Because of this you need to use pretty big patches, otherwise the loss is not-very meaningful, patch size is 64x64 (performance really degrades for small patches)
Possible solutions:
- Do what P-SVDD/Doesrch does, where you give an encoder 2 patches and you need predict their spatial relationship
- Do what we are currently doing but include frequency band information, so that you need to predict location and frequency band (?)

mesarcik commented 2 years ago

Frequency + Neighbour SSL

It seems to be working, but the anomaly detection results are mediocre
The representations it learns seem to be sensible, but the error function i think is letting me down
The distance to anomalous patches doesnt seem to be that different than those which are non-anomalous
I think the difference between my implementation and that used by the previous paper is that they explicitly force jittered' versions of each patch to be near each other by minizing their l2 distance, and we are not.

epoch_embedding_100

Debugging process:

[x] Inspect the neighbours per patch
[x] Inspect the latent space
[x] Evaluate with different patch-sizes
[x] Debug that the neighbours are infact correct
- Fixed an indexing bug, the neighbours were incorrect.

Problems with learning position:

The radio spectrograms are quite different to images
For small patch sizes i think its pretty hard to detect neighburs
below is an example for 64 patch sizes
Here is for 32
Its quite clear that for these example it is very hard to determine which neighbour matches to each query.
I have now increased the patch size and it seems that the position prediciton is now working
I now need to change the frequency encoder to work for varying patchsizes.

Using Frequency band information seems to decrease overall performance

It seems that with a patchsize of 64x64, using only the neighbours seems to decrease overall performance
I think this is because the representations are actually detremental to the overall performance of the model, it associates patches that are quite far away and also those which are maybe quite different to the same location
Additionally the frequency-domain loss converges extremely quickly (validation accuracy of 99% in 3 epochs), meaning that there isn't infact a lot of information per channel

Possible solutions

Try the implementation of the previous paper

mesarcik commented 2 years ago

Data vs. Model problems:

Validation data Frequency prediction accuracy is ~0.999,
Train data frequency prediciton accuracy ~0.998
Test data frequency prediction accuracy ~ 0.82
- For "good classes" it is ~0.83
- For "anomalous classes" it is ~0.63
I would expect that the test accuracy on non-anomaous data to be much higher, its quite strange that we are getting a 20% decrease in performance, this makes me think that there may be data problems

Including normal test data into training

to see if we can mitigate the problems explained above i included all the testing data from the normal class into the training set.
This way when we ensure we have seen all the "normal" data, but still have never seen the anomalies
When i do this I obtain the following results:

Anomaly	Location prediction (original training data)	Location prediction (modified training data)	VAE
oscillating_tile	0.71428	0.91612	0.90123
electric_fence	0.2885	0.90196	0.8956
data_loss	0.6763	0.95384	0.953846
lightning	0.71225	0.976870	0.97297
strong_radio_emitter	0.83727	0.98795	0.987991
solar_storm	0.9006	1.0	0.97350

There a few possible reasons for this: 1) The training set and testing set of non-anomalous data are vastly different (badly labelled) 2) We are overfitting to the training data, so that the unseen, non-anomalous training data is incorrectly represented
A note on the differnece in performance between the VAE and location predictor
- The scores reported above are for the first nearest-neighbour distance, if I take the mean first two nearest neighbours then the accuracy of the model degrades significantly, this suggessts the difference between the neighbour distances for good and bad data is less simillar for the VAE, i.e. we do not learn robust representations:

Anomaly	Location prediction	VAE
oscillating_tile	0.9006	0.7515
electric_fence	0.5664	0.4554
data_loss	0.8761	0.78758
lightning	0.8539	0.77378
strong_radio_emitter	0.9449	0.8861
solar_storm	1.0	0.92358

Further analysis of amount of data

temp

There is a linear relationship between the amount of test data in the training set and the f1 score performance of the model.
The two lines shown are an untrained resnet and the trained positional/freq classifier
The thing that bothers me about these results is the following, looks like the loss we are using is somewhat detremental to the anomaly detection performance

Anomaly	Untrained	Trained
oscillating_tile	0.791367	0.698795
data_loss	0.672432	0.714545
lightning	0.673568	0.695421
strong_radio_emitter	0.838039	0.844560
solar_storm	0.262452	0.860681

Training effects per feature

temp

Training actually seems to decrease the performance of the oscillating tile class, that isn't good....

mesarcik commented 2 years ago

Model diagram for explanation

RAAD

here \lambda z^s is the regularisation term (tries to make all embeddings as close to 0 as possible
the position classifier takes 2 inputs (the two neighbouring patches)
both classification networks as 3 layer MLPs
we can get validaiton accuracies between 90%-95% for predicting both frequency band and position of neighbours (this is pretty good i think)

Augmentations:

One way to potentially improve the performance is to augment the training data such that the model "sees more variety"
This is important as i believe that the reason we get a linear performance improvement when adding more non-anomalous test classes to the training set is that the model has not seen some RFI/astronomical features
For example the plot below shows that the model struggles to find appropirate nearest neighbours when RFI is not in exactly the correct position, this in turn increases the distances, when in fact this sample should be trivial to represent
I guess an option is to decrease the patch size to say 32x32, but this seems to reduce the over all performance
Another is augement the patches so the model is more capable of representing the NNs

Augmentations to try

We need to augment the data before breaking it in to patches
Shifted all patches by 1/2 and 1/4 of the patch size, this improves performance a little, but not dramatically

RFI removal

AFAIK the model has a very good TPR (it can detection anomalies with a high accuracy), but it also has a high FPR
I think the increased FPR comes from RFI, it struggles to determine the correct KNN lookup for patches with a lot of RFI
I need some way to flag RFI but not the anomalies
I.e. we do not want our model to detection RFI as an anomaly

mesarcik commented 2 years ago

State of affairs

Code

The code base is getting increasingly messy (still working on the SSL branch)
To ensure correctness I need to spend some time refactoring

Data

I noticed that in many of the classes there are incorrect labels.
I need to go through the data again to correct class labels
Augmentations ?

Classes to correct

[x] Lightning
[x] Oscillating Tile
[x] High Noise Elements
[x] Solar Storm
[ ] Uncalibrated strong source
[x] Non-anomalous

Model

I am still getting puzzling results, for the solar storm class we are getting an F1-score of 95% and a overall mean of about 90%, but I think that this still can be significantly higher.
Also the interaction in the latent space between the positions frequency and location loss cause some weird results.
I also am doubting whether predicting on frequency band is a good idea, as there is so much stochasticity across a band, that to embed to the same location may be detremental to the model.
I will remove the frequency based loss, as i think its harming the overall performance of the model.

Inspection of outputs and debugging

After inspecting a number of nearest neighbour retrievals for both anomalous and non-anomalous data it seems that the model is not working as expected. For example the image below shows the retrieval of nearest neighbours in the top row and the distances to each patch in the bottom. It is clear that it is not learning the correct representations.
Similarly the example below shows that for a "non-anomalous" input the model is classifying it as anomalous
whereas for this one it seems to be working perfectly

Updating losses:

I think our models learn more about RFI then anything else
Although we are forcing our model to understand the time-frequency behaviour of spectrograms, we are not explicitly making it good at retrieving nearest neighbours.
In the original paper on learning good representations using context-prediction, it seemed to perform pretty well, however i think that is because natural images are far better structured than astronomical data (fewer discontinuities etc))
For this reason i think we need to somehow enforce similar looking patches to the same location

Things to try

[x] Changed classification head to be only Linear -> Relu -> Linear
[x] Add the loss in the P-SVDD paper so that slightly jittered patches are projected to the same location, this will hopefully improve the retrieval performance of the model
[ ] Maybe augment the loss to be a cosine similarity rather than a BCE (something like simclr)

Existing hyperparamerters:

we have the following hyper parameters to play around with:

Parameter	Description
c	clip amount of the training data
p	patch size
lambda	regularisation scaling factor
l	embedding dimension
m	number of layers of MLP
j	jitter amount
r	rolll amount
M	backbone network (resnet15, resnet50, ViT)

Parameter Sweeps

Not really that insightful
Seems that a patch size of 64x64 works best
The models seem to favor smaller jitters and larger latent dimensions

mesarcik commented 2 years ago

Dataset bug found:

We have duplicates of many different samples, this is defintely due to my create_dataset.py script
I think i fixed the bug.

mesarcik commented 2 years ago

Fine tuning

I have created a fine-tuning setup for the SSL pretrained model
Here I add a 2 layer MLP to after the embedding of the pretrained resnet
As the SSL pretrained model is done on patches, we need to reassemble all patches from the same spectrogram to its appropriate label so we can fine tune in a supervised manner

Results:

I am definitely overfitting (training for 50 epochs) the classifier during fine tuning especially for a low test-train split (few training samples)
- For e.g
I think the SSL model isn't properly trained, it is only getting a validation loss of 0.8 (was originally around 0.95)
We include the "normal" class in the training set, so that all "normal" data needs to be correctly cllassified

temp

The oscillating tile class is brining the overall average down, if we remove it then it appears that the randomly initialised model performs slightly better in some cases 🤔
Note: the KNN based result is evaluated on all the test data, whereas the fine-tuned model are evaluted on the remaining test data

With only 20 epochs

Comparison with KNN

Trained for 20 epochs, 0.3 test train split, patch_size 64

Class	SSL + KNN	SSL + Fine-tuning	Random init + Fine-tuning
oscillating_tile	0.7596	0.5454	0.4687
data_loss	0.6100	0.77419	0.76546
lightning	0.8022	0.8316	0.6655
strong_radio_emitter	0.8414	0.8210	0.8283
solar_storm	0.78287	0.9897	0.9847
mean	0.7606	0.7964	0.7464

Conclusions

It seems that ssl pretraining+fine-tuning is a viable option depending on how much data we are willing to sacrifice (anything below 30% doesnt make sense)
I think our learnt SSL represenations are not particularly good
- The context and jitter loss are not offering good losses
I think RFI is a bigger problem than i expected, the stochasticity and the high power make it challenging for KNN lookups

More Ideas:

See the perfomance of the unsupervised anomaly detector on the overall detection
- i.e. How well can the KNN search distinguish between normal and anomalous data (rather than per class)
- If it works well then we could technically exclude all "normal" data from the fine-tuning thereby only fine-tuning on the known anomalies in the dataset
- I jsut did the test, this actually works worse for the pretrained model, because it doesnt have any examples of anomalies, meaning that the randomly initialised one works best

mesarcik commented 2 years ago

Updates

It seems we can detect anomalous samples from normal samples with a F1-score of 91%
With a randomly initialised encoder this number of 79.54!
TLDR;
- Fine tuning does not yield improved performance
- Fine tuning is a way to detect each class separately
- Jittering seems to decrease performance
- Training for longer improves performance
- Preprocessing of is important, clip -> log -> norm

Clip amount vs. average F1-Score

temp

The clip is a symmetrical, eg. we clip at the 5th and 95th percentiles
previously i was defaulting the clip to 5, which is clearly shown here to have the worst performance

Class	SSL + KNN	SSL + Fine-tuning	Random init + Fine-tuning
oscillating_tile	0.7826
data_loss	0.7104
lightning	0.8089
strong_radio_emitter	0.9133
solar_storm	0.9898
mean	0.8410

Clip @ 10, fine-tune for 20 epochs on 30% of the data

Intermediate results:

In the case of the F1 score of 91, we have the following effects:
- We evaluate on 3200 samples, 1200 of which are normal, so the ratio isn't particularly good
- However, there are 363 incorrectly classified sampels
- From this 193 are "normal" samples that we classified as anomalous
- And 160 are "anomalous" samples classified as normal
Noticed that there is another small bug in the dataset, i was reading from a directory that i was meant to exclude.

Going through the incorrectly classified examples:

Class	# Misclassified samples
oscillating_tile	2
data_loss	91
lightning	54
strong_radio_emitter	13
solar_storm	0

My suspicion is that many of these samples come from the excluded datasets (i.e. their labels are wrong)

Oscillating Tile:

1200

Very faint feature, probably because there are some mislabelled examples in the training class
The second misclassified example is from the excluded data

Data loss:

1355

It looks like we are detecting something, but just not well enough
In some cases its very faint, actually all spectrograms from L697475 contain this very faint data loss feature (71 Baselines)
Out of the 91 incorrect i counted 70 have a single channel that is zero, this seems to be the limit for our model
Should we remove it from the training set?

Lightning

out of the 54 samples, 35 of them are not from the excluded data,
Of which ~30 are from the same measurement set L786921
I think the reason this measurement set is not being detected is because the bandpass effects are pretty well represented in the training data and the model is thinks the majority of the data is good, however a single patch has an exremely high value.
- In other words, i think the way we aggregate the distances (mean) cause a decrease in detection performance
- Maybe by first removing all patches that are below some threshold, and then doing the mean, may help improve robustness?

Strong radio emitter:

Many mislabelled examples, a couple of false classifications

mesarcik commented 1 year ago

Labelling update

In refining the labelling, I have done a pass on the "normal" class
In doing so I removed about 1000 samples
I have been trying to train a few more models but it seems that I cannot get the same amount of performance as previously obtained (on the partially incorrect labels)
I think this is due to there being 30% less training data
Im now going through the incorectly labelled scintillation and high_noise_element classes to try recover more samples

mesarcik commented 1 year ago

Code updates:

I would like to change the evaluation strategy after training the model
As previously discussed I think our best approach is to do self-supervised pretraining followed by multi-class anomaly detection using KNNs, finally we do fine-tuning and perform a multiclass classification on the detected anomalies.

TODO:

[x] Restructure the data.py file so that the testdata_loader is able to hold all anomalous classes
[x] Make a get_anomaly function for the dataloader that returns the dataloader with only a specific anomalous class (or all anomalous classes)
[x] Make the eval.py class evaluate each class separately as well as the multi-class detection scenario
[x] Make eval.py not load each anomalous class separately.
[x] Include fine tuning into training and evaluation procedure

mesarcik commented 1 year ago

State of affairs

Currently we have the problem of keeping enough "normal" data for testing
It seems that the performance decreases the more data we reserve for testing
There are two options:
- Sub-sample the anomalous samples for testing to obtain a reasonable test train split
- Label more normal data
  TODO:
[x] Evaluate the specifics of which samples we are struggling to detect
[x] Run a sweep of different test/train sizes for normal/anomalous data

Amount of training data:

I saw a trend after cleaning about 1000 samples from the normal data class that the performance of our models dropped
In order to validate this I ran an experiment to check what the performance difference is when using less training data (shown below)
It is clear that less training data = lower performance
Furthermore if we look at the convergence of the position based loss of these models, they are not converging in the new iteration of the dataset (previous validation loss was 0.95 it is currently between 0.75 - 0.8)
[ ] What about Augmentation?

Data imbalance in the test set

I have noticed that when evaluating on the "all" class there are some imbalance issues coming into play
I.e. there are ~3000 anomalous labelled classes, but only 800 normal classes, I think to overcome this issue we will have to subsample to the original distribution.
In the most recent iteration of the dataset we have the following distribution:

Class	# Samples	% Contamination
training data (normal)	2533	-
test data (normal)	800	-
Total data	~6500	-
data_loss	413	6%
electric_fence	62	1%
lightning	327	5%
oscillating_tile	57	1%
real_high_noise	869	13%
solar_storm	147	2%
strong_radio_emitter	1334	20%

Back of the envelope calculations in favour of subsampling

Out of the ~14000 samples that I had from Adder we have the following breakdown
What this shows is that even we label an extra 2000 samples then
- we will probably only have 1000 normal samples
- Several unknowns and say 500 anomalies
This will not help our class imbalance issue for testing because we would need another few thousand testing samples, which is really not practical.

Name	# Samples
1 Subband observations	~2000
Unknown data (not normal, but no characterised anomaly)	~1500
Normal	~3500
Anomalies	~3000
Unlabelled	~2000

I think we need to subsample anomalous classes to obtain reason anomaly-normal splits.

Subsampling anomalous classes for testing
There is a clear trend in

mesarcik commented 1 year ago

Post-Holiday To-do:

[ ] Subsampling anomalies in the testing set
- This might give an opportunity to get samples for the fine-tuning model.
[ ] Using cross-correlations to increase the number of training samples
[ ] Evaluate the fine-tuning on only the classes that the SSL+KNN method determines to be anomalous
[ ] Write paper for URSI
[ ] Compare results against VAE and randomly initialised SSL model to validate that it is in fact working
[ ] Inspect misclassified examples to determine possible incorrect labels.
[ ] Expand field for position classification

mesarcik / ROAD

Experimentation #2

Dataset information

Dataset overview

Labelling interface/the way things are plotted

Example:

Potential directions

Option 1:

Option 2:

Option 3:

Preliminary experiments

Self-supervised learning example

Method

Problems with normalisation

Low band with data loss

Normalised

Not

High band

Normalised

Not

Uncertain features

Validate labels:

TODO:

Notes:

Sample of the dataset clipped at 99th percentile

Initial results

Discussion

Reconstructions of KNNs

Using SSL

Training:

Results:

Embedding for Dataloss

TODO

Update after 1 week

Multi-class detection

Single Class

Open questions:

TODO

Weird results:

Changed encoding scheme

Station names

Frequency range:

Polarisations:

Different anomaly detection evaluation:

What other self-supervised labels can we use?

Other options:

Comparison between distance based metric and frequency band information:

Location prediction (spatial context prediction)

Issues with current implementation

Frequency + Neighbour SSL

Debugging process:

Problems with learning position:

Using Frequency band information seems to decrease overall performance

Possible solutions

Data vs. Model problems:

Including normal test data into training

Further analysis of amount of data

Training effects per feature

Model diagram for explanation

Augmentations:

Augmentations to try

RFI removal

State of affairs

Code

Data

Classes to correct

Model

Inspection of outputs and debugging

Updating losses:

Things to try

Existing hyperparamerters:

Parameter Sweeps

Dataset bug found:

Fine tuning

Results:

With only 20 epochs

Comparison with KNN

Conclusions

More Ideas:

Updates