onclave / NSGA-II

an implementation of NSGA-II in java
MIT License
43 stars 22 forks source link

Where do i add the dataset in the code #7

Closed RJoshlan closed 3 years ago

RJoshlan commented 3 years ago

Where do i add the dataset in the code, Its a ddos attack dataset used to conduct feature selection. And also i have an "Exception in thread "main" java.lang.NoClassDefFoundError: org/jfree/data/xy/XYDataset" even though i have added the jfree and jcommon jars to the project.

onclave commented 3 years ago

For jFree error, refer to issue #8 .

onclave commented 3 years ago

I shall provide a detailed documentation on how to use external datasets with this library shortly. Working on it.

RJoshlan commented 3 years ago

thanks

onclave commented 3 years ago

Hello @RJoshlan, refer to the documentation here under the Getting Started section to understand how you can use your own custom datasets with the library. Let me know if you face any issues.

RJoshlan commented 3 years ago

I am using this dataset (CICIDS2017) and i am not sure how to read the dataset using GeneticCodeProducer. I tried but the code doesn't even compile. Can you help me with this because i don't know where i am going wrong. Also i tried using permutation based encoding and still it doesn't work

` public static GeneticCodeProducer geneticCodeProducerFromDataset(String path) { return (length) -> {

        List<BooleanAllele> geneticCode =  new ArrayList<>();

        try {
            DataSource dataSource = new DataSource(path);
            // Loading the dataset
            Instances getData = dataSource.getDataSet();
            //length = getData.numAttributes();
            String geneFormat = "%0"+ calculateGeneSize(path) +"d";
            length  = getData.size();

            while (geneticCode.size() < length) {
                int data = ThreadLocalRandom.current().nextInt(1, getData.size());

                String gene = String.format(geneFormat, returnBinaryValueFromInt(getData.get(data).numAttributes()));

                for (char alleleChar: gene.toCharArray()) {
                    geneticCode.add(new BooleanAllele(returnBooleanValueFromChar(alleleChar)));
                }
            }

        } catch (Exception e1) {
            e1.printStackTrace();
        }

        return geneticCode;

    };
}`
onclave commented 3 years ago

@RJoshlan I shall need more information about your work before I can help you. Firstly, provide me with the dataset you are working with so that I can take a look into it. Next, give me a very brief idea about how you want to encode your chromosomes with your dataset. Third, let me know what kind of encoding you want to use with your chromosomes.

I see you are trying to use BooleanAllele to which I assume you have tried to use binary encoding. Do keep in mind that for binary encoding, usually, you do not encode your dataset directly into the chromosome, rather keep a reference to it.

RJoshlan commented 3 years ago

@onclave Thanks for replying.

This is the datatset that i am using. It has 81 variables and 25000 instances. (Original has around 200,000 instances and 81 variables but i've uploaded a small one because of file size) Dataset.zip

The chromosome encoding that i am trying to acheive is in a way which directly depends on the dataset. Eg producing values which represent the dataset variable index's from 1- 80 where chromosome length for example might be 6 alleles chosen from the 80 variables.

Also you were write that i used binary encoding but i am getting the logic wrong when trying to keep dataset as reference to chromosomes as a result i contacted you for help.

Thanks.

onclave commented 3 years ago

If I may make a guess, you basically have a 2D dataset with 81 columns (attributes) and 25k rows (samples). You would probably want to create a population out of this. Since I don't know what your work is and what you are trying to achieve, I shall take an example problem out of it and explain how to solve that using this library and then you can use that knowledge to see how that fits to your problem set.

Problem: Let's say, considering samples, you want to do feature selection among the 81 attributes trying to select 5 marker attributes.

Solution:

Each of your chromosomes shall be binary encoded of length 81. In the beginning, randomly generate a population of N number of chromosomes. The genetic code for each chromosome represents a probable solution. The indices with Allele value 1 is considered as selected attribute and 0 is considered not selected. This is how you keep reference to your dataset with the chromosome.

Prepare your own objective functions against your dataset. They can be maximization problems or minimization problems. This library considers all objective functions to be maximization problems. Hence, for any minimization problem, take its inverse.

For each chromosome, based on its genetic code, prepare a subset of your dataset selecting only those attributes which are "1". Again, this is how you keep reference to your dataset with your NSGA-II code. NSGA-II will run the objective functions for you and the objective functions will work with your dataset to provide objective values or "fitness" for your chromosomes. NSGA-II will use these values for each chromosomes to then perform non-dominated sorting, rank assignment and crowding-distance assignment. After G generations, NSGA-II will return you the Pareto Front.

All this will be managed by NSGA-II and you do not have to actually change any code within the library. All you have to do is to write your own objective function and provide it to NSGA-II. You usually do not need to directly feed your dataset to the GeneticCodeProducer.

For your objective functions, it takes a chromosome. So, given a chromosome, you write your own logic on how this chromosome is used to prepare a subset of your original dataset in reference to its genetic code and what operations to perform on this subset in order to return a double value.

Once you have your Pareto Front, you use your own logic to select one chromosome as your final solution. This is not part of the NSGA-II package.

Once you have your selected solution, you use your own logic to select 5 best markers as your resultant biomarkers. This is not part of the NSGA-II package.

I hope this is explanatory enough to understand how to use your own dataset and work with this package.

RJoshlan commented 3 years ago

Thanks it more clear now. I do have the objective functions but i was getting it wrong when trying to encode using the dataset.

Just a point - you mentioned

""You usually do not need to directly feed your dataset to the GeneticCodeProducer"".

What you mean by this?

onclave commented 3 years ago

@RJoshlan it means that, usually, when you are using binary encoding on your chromosomes, it is rare that you would use your dataset values directly into your chromosome's genetic code. In the above example the data from the dataset was never fed directly into the NSGA-II algorithm, instead a reference was kept and the only interfacing that the dataset did with the algorithm was at the Objective Function level.

onclave commented 3 years ago

I'm closing this issue since I hope I was able to resolve your issue. Reopen it if you have more queries.

RJoshlan commented 3 years ago

@onclave Thanks for for your help.