Parallel fuzzing, etc. - Githubissues

mariadb-RoelVandePaar commented 2 years ago

I have a working fuzzing instance on MariaDB. When I try and parallel fuzz, it does not work as each slave gets it's own SHM_ENV_VARwhereas the started database server offcourse only has one attached __AFL_SHM_ID. How can one parallel fuzz? is there a way to use the single instrumented server with multiple AFL instancs executing queries via shared memory? Note that for the database server this should be perfectly fine, i.e. Squirrel can be modified CREATE DATABASE a1000;, CREATE DATABASE b1000; etc - i.e. the server can easily execute multiple queries at the same time against different databases with the same number but differentiated another way, in this example with a, b etc. for different parallel afl-fuzz fuzzer instances. AFLplusplus supports parallel fuzzing, but it seems Squirrel does not?

Changochen commented 2 years ago

Hi Roel, sorry for the late reply. And thanks for being interested in using Squirrel in MariaDB. Squirrel didn't take parallelism into design since there were not many good SQL fuzzers at that time so we just wanted to create a useful and practical one. It is far from perfect, but it is effective :P.

You are right, there's nothing stopping Squirrel from being parallel. However, we currently don't have plans for supporting parallel fuzzing. Both of the maintainers (me and @zr950624) are closed to graduation year and we need to focus on that. However, afl or aflpp style parallel fuzzing is not special but mostly about syncing their corpus among the fuzzers. So instead of modifying the architecture, maybe we can still spawn a bunch of servers, but create a shared folders for the fuzzers' output, and use afl-like corpus synchronization so that the fuzzing progress can be shared.

Hope this helps!

mariadb-RoelVandePaar commented 2 years ago

Hi @Changochen!

" it is effective " - Indeed it is :P - Thank you for making a great tool!

The answer clarifies/helps, thank you. Hope your graduations will be Blessed and go super well.

While I have you, I have two other questions and one FYI.

Question1: In the output window, we see a status like this: bit flips: infty/10.3k, 0/0, 0/0

And it is the only counter which increases.

I assume that this counter (while incorrectly named 'bit flips' in the GUI output) actually represents (and internally uses) your modified (IR-based) algo - correct?

Question 2: How hard would it be to modify s3team/Polyglot for SQL? Could you give me a few starting pointers?

Question 3: How to best vary the language files further and modify/update the files?

In https://github.com/s3team/Squirrel/tree/master/MariaDB/docker/init_lib we find strip_mysql and strip_trigger_collect.

One directory higher, we also find global_data_lib_mysql and safe_generate_type_mysql.

How are these files generated, and how could one generate additional ones? What special function do these serve?

FYI: Full scale automation framework around squirrel as part of mariadb-qa. Ref. https://github.com/mariadb-corporation/mariadb-qa/tree/master/squirrel (modified clone) https://github.com/mariadb-corporation/mariadb-qa/tree/master/fuzzer https://github.com/mariadb-corporation/mariadb-qa/blob/master/fuzzer/PROCEDURE https://github.com/mariadb-corporation/mariadb-qa/blob/master/fuzzer/SETUP (WIP, will be finished in next 1-2 weeks)

You may also be interested in the pquery/mariadb-qa framework (the latest iteration of which is part of mariadb-qa), which is an other excellent tool for testing as well as building/automating instances etc. - ref https://www.percona.com/blog/2015/03/17/free-mysql-qa-and-bash-linux-training-series/ (the later episodes about the framework, which is now in it's second iteration with MariaDB as linked here). And it's results (now includes some Squirrel found bugs!)

Feel free to let anyone interested reach out.

Thank you very much and have a great weekend!

zr950624 commented 2 years ago

Hi @mariadb-RoelVandePaar, Thanks for being interested in fuzzing with Squirrel. For your questions,

Question 1: You are right that we did not adopt any mutation strategy from AFL, we have our own mutator which works on IR. We didn't update that part of the code in GUI, so it might be misleading. You can check more details in our paper https://arxiv.org/pdf/2006.02398.pdf.
Question 2: Polyglot is designed for fuzzing general programming language processors, and It focuses more on the type system. But it can help us generate the frontend(parser, IR) for SQL. What I would recommend is, adopting the frontend generation part from Polyglot and keeping the mutation/fuzzing part of Squirrel.
Question 3: These files are our initial seed for SQL structure. To generate it, you can use our function parser to translate the SQL to AST, then use the method translate to make it to IR. After you have the IR, strip the identifier with extract_struct and translate it back to SQL with method to_string.

Also thanks for your information! We are glad to see Squirrel is practical to use in industry projects. Yongheng and I still work around fuzzing projects, so we are happy to look at the resource of the MariaDB-QA framework since it may inspire us for further projects.

Hope my answers help!

mariadb-RoelVandePaar commented 2 years ago

Thank you very much @zr950624, and thank you for your input, it definitely helps.

"We are glad to see Squirrel is practical to use in industry projects" - yes, it most definitely is being used well.

We have found a substantial amount of bugs with the tool you developed, and we are only just starting. I also see in the white paper that this was an intention of the project (sharing with the industry) - which is great - thank you to you, @Changochen and the rest of the team!

We use reducer.sh to reduce the testcases, and other pquery/mariadb-qa framework tools to generate full reports for all versions etc. An example of a bug here: https://jira.mariadb.org/browse/MDEV-29017

As for question 3, let me see if I understand this correctly.

So I would take some initial SQL I have, then use the parser function to translate this input SQL to AST. Where and how would the information be stored please?

Then use the translate (this is all in C++ I assume), to translate the AST to IR - again where is the info stored?

And finally use extract_struct to strip the 'identifier' (what is 'identifier' here please?), and translate it back to SQL with to_string. Could you explain why the SQL-AST-IR-SQL loop is necessary? Or is this the mutation in essence?

We have plenty of input SQL which could be used for mutation. In such a case, are the above steps still necessary, or could we simply insert the SQL as-is into the framework? And if so, how?

I have tried changing the grammar by setting up a completely new directory with new txt SQL files and pointing to the same with -i, but Squirrel still seems to be using some of the other files mentioned above, somehow (perhaps it is using global_data_lib_mysql, safe_generate_type_mysql, strip_mysql or strip_trigger_collect or the init_lib directory?)

My apologies, I am trying to better understand how to introduce a new grammar correctly into Squirrel.

mariadb-RoelVandePaar commented 2 years ago

For the last question, allow me show an example:

~/fuzzing/fuzzing_root/minimal$ ls
1.txt  10.txt  2.txt  3.txt  4.txt  5.txt  6.txt  7.txt  8.txt  9.txt

~/fuzzing/fuzzing_root/minimal$ cat *.txt
CREATE TABLE t AS SELECT 1;
INSERT INTO t VALUES (a);
DROP TABLE IF EXISTS t;
CREATE TABLE t (c INT) ENGINE=MyISAM;
CREATE TABLE t (c INT) ENGINE=InnoDB;
CREATE TABLE t (c INT) ENGINE=Aria;
CREATE TABLE t (c INT) ENGINE=MEMORY;
INSERT INTO t VALUES (1);
INSERT INTO t VALUES (0);
INSERT INTO t VALUES (NULL);

I created this example SQL and put it into ./minimal in fuzzing_root. Then I attempted to let Squirrel use this SQL by using -i ./minimal. However, on checking the fuzzer queue I see for example:

 CREATE TABLE v0 AS SELECT DISTINCT 99 , -128 , TRUE , FALSE GROUP BY 'x' , 'x' , NULL , 'x' , 'x' HAVING NULL = 1423377.000000 * 94585408.000000 ;
 CREATE TABLE IF NOT EXISTS v0 AS SELECT 5767244.000000 GROUP BY TRUE HAVING 30798334.000000 ;
 CREATE TABLE v0 AS SELECT DISTINCT TRUE , ( SELECT CASE WHEN 'x' THEN 49 END ) , 'x' ;

Which are queries clearly far away from the SQL given as input.

This SQL seems to come from init_lib - maybe init_lib is harcoded somewhere or similar in Squirrel?

How to change the grammar so only the ./minimal SQL is used (as an example)?

Changochen commented 2 years ago

Thanks @zr950624 for the answer. Let me add a few extra information.

Question 2: We don't expect it hard to adopt PolyGlot on SQL. You will need the grammar specification (BNF) of SQL and add a bit semantic annotation to it. You can see an example for C here. The semantic rules for SQL should be easier than other general purpose language. In our testing on SQLite, we don't see too much strength of PolyGlot over Squirrel. The main strength should be that you can easily modify the grammar to other SQL dialect. It is not that straight-forwards to do so in Squirrel.
Question 3: @zr950624 explained how to generate the files. I will explain a bit about what the files are for.
- init_lib: Squirrel maintains a list of interesting structures for every type of ASTs. Its mutation is all about combining SQL segments into new testcases. However, the initial corpus might not cover all syntactic structures of the grammar. So the init lib is to provide as many grammar structure as possible. As long as every grammar rule is covered by at least one SQL statement, the init lib is good enough. If I remember correctly, we just collect some random SQL files and try to cover as much grammar as possible. Since the SQL for Mariadb and MySQL is a bit similar, we also use some of the mysql init lib.
- global_data_lib_mysql are just the names of builtin SQL functions. They are used for filling in stripped SQL. For example, select X(Y), X can be filled with a function name.

Changochen commented 2 years ago

It is a bit embarrassing but for safe_generate_type_mysql, we need to further look at the code to remember what it is for. But everything under init_lib will be used for mutation. So the rule is, you cover as much grammar structure as possible in init_lib (used as the source for mutation), and use the likely-bug triggering ones as -i(the initial corpus).

Changochen commented 2 years ago

For the last question, allow me show an example:
~/fuzzing/fuzzing_root/minimal$ ls
1.txt  10.txt  2.txt  3.txt  4.txt  5.txt  6.txt  7.txt  8.txt  9.txt

~/fuzzing/fuzzing_root/minimal$ cat *.txt
CREATE TABLE t AS SELECT 1;
INSERT INTO t VALUES (a);
DROP TABLE IF EXISTS t;
CREATE TABLE t (c INT) ENGINE=MyISAM;
CREATE TABLE t (c INT) ENGINE=InnoDB;
CREATE TABLE t (c INT) ENGINE=Aria;
CREATE TABLE t (c INT) ENGINE=MEMORY;
INSERT INTO t VALUES (1);
INSERT INTO t VALUES (0);
INSERT INTO t VALUES (NULL);
I created this example SQL and put it into ./minimal in fuzzing_root. Then I attempted to let Squirrel use this SQL by using -i ./minimal. However, on checking the fuzzer queue I see for example:
 CREATE TABLE v0 AS SELECT DISTINCT 99 , -128 , TRUE , FALSE GROUP BY 'x' , 'x' , NULL , 'x' , 'x' HAVING NULL = 1423377.000000 * 94585408.000000 ;
 CREATE TABLE IF NOT EXISTS v0 AS SELECT 5767244.000000 GROUP BY TRUE HAVING 30798334.000000 ;
 CREATE TABLE v0 AS SELECT DISTINCT TRUE , ( SELECT CASE WHEN 'x' THEN 49 END ) , 'x' ;
Which are queries clearly far away from the SQL given as input.

This SQL seems to come from init_lib - maybe init_lib is harcoded somewhere or similar in Squirrel?

How to change the grammar so only the ./minimal SQL is used (as an example)?

Therefore, for this example, it is normal that the query from init_lib is used. If you want to only use the test cases in minimal, use it in both -i and as the init_lib.

Changochen commented 2 years ago

FYI: Full scale automation framework around squirrel as part of mariadb-qa. Ref. https://github.com/mariadb-corporation/mariadb-qa/tree/master/squirrel (modified clone) https://github.com/mariadb-corporation/mariadb-qa/tree/master/fuzzer https://github.com/mariadb-corporation/mariadb-qa/blob/master/fuzzer/PROCEDURE https://github.com/mariadb-corporation/mariadb-qa/blob/master/fuzzer/SETUP (WIP, will be finished in next 1-2 weeks)

You may also be interested in the pquery/mariadb-qa framework (the latest iteration of which is part of mariadb-qa), which is an other excellent tool for testing as well as building/automating instances etc. - ref https://www.percona.com/blog/2015/03/17/free-mysql-qa-and-bash-linux-training-series/ (the later episodes about the framework, which is now in it's second iteration with MariaDB as linked here). And it's results (now includes some Squirrel found bugs!)

That's AWESOME!!!! we will definitely check it out!!!

Changochen commented 2 years ago

Ok, safe_generate_type_mysql is not used at all (for some feature we finally abandon). So please ignore it LOL

mariadb-RoelVandePaar commented 2 years ago

Thank you so much @Changochen for the additional input. It makes a lot more sense now. One final question, you said "Question 3: @zr950624 explained how to generate the files" - which files would that be? The in-program mutation files (which I thus would not need to build to change the grammar if I understood correctly, or something else)?

If you don't mind could we leave this ticket open for a bit to see if additional questions arise? Thank you.

zr950624 commented 2 years ago

Thank you so much @Changochen for the additional input. It makes a lot more sense now. One final question, you said "Question 3: @zr950624 explained how to generate the files" - which files would that be? The in-program mutation files (which I thus would not need to build to change the grammar if I understood correctly, or something else)?

If you don't mind could we leave this ticket open for a bit to see if additional questions arise? Thank you.

What I explained is how I generated the init_lib. Specifically, how to convert select id ,name from student to select x, x from x, so we can put it into init_lib. If you modify the grammar, it is better to put the newly added grammar structure intoinit_lib.

mariadb-RoelVandePaar commented 2 years ago

Aha, understood. And now the syntax inside init_lib makes more sense too! Thank you very much.

@zr950624 is there perhaps an example script which takes all the steps parser>translate>extract_struct>to_string ?

zr950624 commented 2 years ago

Aha, understood. And now the syntax inside init_lib makes more sense too! Thank you very much.

@zr950624 is there perhaps an example script which takes all the steps parser>translate>extract_struct>to_string ?

We do not provide an example script at the present time. I give out the following code snippet to show the workflow. Note, I did not compile it, you may need to do some modifications in order to run it.

//// extract.cpp
/// ./extract filename
int main(int argc, char * argv[]){
    string Sql;
    ifstream Infile(argv[1]);
    unordered_set<string> SqlStructs;
    Mutator * Mutr = new Mutator();  

    while(getline(Infile, Sql)){
        if(Sql.empty()) continue;
        auto p = parser(Sql);
        if(p == NULL) continue;

        vector<IR *> IrVec;
        IR * Root = NULL;
        try{
            Root = p->translate(IrVec);
            p->deep_delete();
        }
        catch(...){
            continue;
        }

        string SqlStruct = Mutr->extract_struct(Root);
        if(SqlStructs.count(SqlStruct)){
            deep_delete(Root);
            continue;
        }

        SqlStructs.insert(SqlStruct);
        deep_delete(Root);
    }

    for(string &OutSql: SqlStructs)
        cout << OutSql << endl;

    return 0;
}

mariadb-RoelVandePaar commented 2 years ago

@zr950624 that is above and beyond. Thank you very much.

mariadb-RoelVandePaar commented 1 year ago

@zr950624 Hi again! Hope you are both doing well. I took the code, added headers:

#include <fstream>
#include <string>
#include <cstring>
#include <bits/stdc++.h>
#include "include/mutate.h"

And tried to compile, but get the following errors:

extract.cpp: In function ‘int main(int, char**)’:
extract.cpp:37:48: error: conversion from ‘void’ to non-scalar type ‘std::string’ {aka ‘std::__cxx11::basic_string<char>’} requested
    |         string SqlStruct = Mutr->extract_struct(Root);
      |                            ~~~~~~~~~~~~~~~~~~~~^~~~~~
extract.cpp:47:25: error: binding reference of type ‘std::string&’ {aka ‘std::__cxx11::basic_string<char>&’} to ‘const std::__cxx11::basic_string<char>’ discards qualifiers
    |     for(string &OutSql: SqlStructs)
      |

Do you know how to fix this? Thank you!

zr950624 commented 1 year ago

@zr950624 Hi again! Hope you are both doing well. I took the code, added headers:

#include <fstream>
#include <string>
#include <cstring>
#include <bits/stdc++.h>
#include "include/mutate.h"

And tried to compile, but get the following errors:

extract.cpp: In function ‘int main(int, char**)’:
extract.cpp:37:48: error: conversion from ‘void’ to non-scalar type ‘std::string’ {aka ‘std::__cxx11::basic_string<char>’} requested
    |         string SqlStruct = Mutr->extract_struct(Root);
      |                            ~~~~~~~~~~~~~~~~~~~~^~~~~~
extract.cpp:47:25: error: binding reference of type ‘std::string&’ {aka ‘std::__cxx11::basic_string<char>&’} to ‘const std::__cxx11::basic_string<char>’ discards qualifiers
    |     for(string &OutSql: SqlStructs)
      |

Do you know how to fix this? Thank you!

We do have 2 versions of extract_structdue to the previous messy implementation. I think what you want is extract_struct in srcs/internal/sqlite/include/mutator.h.

mariadb-RoelVandePaar commented 1 year ago

@Changochen Hi! I updated the include to mutator.h from sqlite. However, the outcome is the same:

extract.cpp: In function ‘int main(int, char**)’:
extract.cpp:37:48: error: conversion from ‘void’ to non-scalar type ‘std::string’ {aka ‘std::__cxx11::basic_string<char>’} requested
   37 |         string SqlStruct = Mutr->extract_struct(Root);
      |                            ~~~~~~~~~~~~~~~~~~~~^~~~~~
extract.cpp:47:25: error: binding reference of type ‘std::string&’ {aka ‘std::__cxx11::basic_string<char>&’} to ‘const std::__cxx11::basic_string<char>’ discards qualifiers
   47 |     for(string &OutSql: SqlStructs)
      |                         ^~~~~~~~~~

The compile command is:

g++ -o extract extract.cpp -fpermissive

What am I missing here?

mariadb-RoelVandePaar commented 1 year ago

@Changochen Hi :)

zr950624 commented 1 year ago

@Changochen Hi! I updated the include to mutator.h from sqlite. However, the outcome is the same:

extract.cpp: In function ‘int main(int, char**)’:
extract.cpp:37:48: error: conversion from ‘void’ to non-scalar type ‘std::string’ {aka ‘std::__cxx11::basic_string<char>’} requested
   37 |         string SqlStruct = Mutr->extract_struct(Root);
      |                            ~~~~~~~~~~~~~~~~~~~~^~~~~~
extract.cpp:47:25: error: binding reference of type ‘std::string&’ {aka ‘std::__cxx11::basic_string<char>&’} to ‘const std::__cxx11::basic_string<char>’ discards qualifiers
   47 |     for(string &OutSql: SqlStructs)
      |                         ^~~~~~~~~~

The compile command is:

g++ -o extract extract.cpp -fpermissive

What am I missing here?

Hi, sorry for the late reply. We are busy with our course-related work since we reach the end of the semester. Could you please share your full code (with the compile directory path as well) instead of only the error message?

mariadb-RoelVandePaar commented 1 year ago

Hi @Changochen ! Sure, no worries and thank you. Full code:

#include <fstream>
#include <string>
#include <cstring>
#include <bits/stdc++.h>
#include "include/mutator.h"

using namespace std;

int main(int argc, char * argv[]){
    string Sql;
    ifstream Infile(argv[1]);
    unordered_set<string> SqlStructs;
    Mutator * Mutr = new Mutator();  

    while(getline(Infile, Sql)){
        if(Sql.empty()) continue;
        auto p = parser(Sql);
        if(p == NULL) continue;

        vector<IR *> IrVec;
        IR * Root = NULL;
        try{
            Root = p->translate(IrVec);
            p->deep_delete();
        }
        catch(...){
            continue;
        }

        string SqlStruct = Mutr->extract_struct(Root);
        if(SqlStructs.count(SqlStruct)){
            deep_delete(Root);
            continue;
        }

        SqlStructs.insert(SqlStruct);
        deep_delete(Root);
    }

    for(string &OutSql: SqlStructs)
        cout << OutSql << endl;

    return 0;
}

And compile command (version 9.4.0):

g++ -o extract extract.cpp -fpermissive

And I manually copied in the required files to ./include and ./parser (i.e. it becomes 'standalone'):

$ ls include/
ast.h  define.h  mutator.h  utils.h
$ ls parser/
bison_parser.h  flex_lexer.h  parser_typedef.h

Thank you!

zr950624 commented 1 year ago

Can you check mutator.h for the return value of extract_struct method.

Changochen commented 1 year ago

Hi @mariadb-RoelVandePaar, sorry for the late reply since I don't check closed issues.

For your question, we have two versions of extract_struct as mentioned above.

The first one is https://github.com/s3team/Squirrel/blob/b9fc9b7785ac5f3368d5c172ce6f9d1c113c0499/srcs/internal/sqlite/include/mutator.h#L43

The second one is https://github.com/s3team/Squirrel/blob/b9fc9b7785ac5f3368d5c172ce6f9d1c113c0499/srcs/internal/mysql/include/mutator.h#L65

I think you are using the wrong one according to the errors you get.

Would you mind double-checking?

If you want to use the return void version, then you can call Root->to_string() afterward to get an std::string.

s3team / Squirrel

Parallel fuzzing, etc. #11