Implement KMeans visualization - Githubissues

packing-box / docker-packing-box

Docker image gathering packers and tools for making datasets of packed executables and training machine learning models for packing detection

GNU General Public License v3.0

49 stars 10 forks source link

Implement KMeans visualization #29

Closed smarbal closed 1 year ago

smarbal commented 1 year ago

Implemented KMeans visualization which produces a scatter plot (see 6b82502b2a78723970a6197d5389f27c4259f214). It isn't optimal since it will recompute features for all the data, which has already been done during the train phase, but I couldn't manage to do it otherwise.

This pull request also contains small fixes following packing-box/docker-packing-box#28. Some fixes still need to be done on that issue. The image that I had produced was using my temporary fix and didn't take last commit into account.

smarbal commented 1 year ago

@dhondta I'll look into it but I was having an error with L43 since it couldn't return anything. I'll provide the traceback soon.

dhondta commented 1 year ago

@smarbal this is not the point ! I forgot to put the line for setting the protected attribute ; this should be line 49 :

        self._source = p

smarbal commented 1 year ago

@dhondta The change has been reversed.

smarbal commented 1 year ago

I've also added self.source = p on line 49 and can push the commit if you want. It solves issue #28.

dhondta commented 1 year ago

@smarbal I just pointed out something :

    X = PCA(20).fit_transform(X)
    X = TSNE(n_components=2).fit_transform(X)

Why do you use a PCA with 20 components then a t-SNE with dimensionality 2 ? Why not directly a PCA with 2 components ?

smarbal commented 1 year ago

@dhondta I was getting mixed results on my first tries with directly a PCA with 2 components or directly a t-SNE with 2 components. t-SNE is recommended because usually better than PCA, but wasn't having good performances, probably due to the high dimensionality of the data. I've found multiple sources that therefore first use PCA as pre-processing to reduce the dimensionality and then t-SNE for visualization.

A configuration file or parameters in the visualization tool to finetune these parameters would probably be interesting.

dhondta commented 1 year ago

@smarbal I refactored lib.learning.visualization for parametrizing some options of the dimensionality reduction. Can you retest and tell me if it works as intended ?

smarbal commented 1 year ago

@dhondta It does seem to work as intended. Note that previously the result image would be saved in /experiments, while here it gets saved in the root directory (directly in /mnt/share).