Inc/Exc priority rules precedence

zette commented 9 years ago

It seems that right now the includes (path, regex, extensions) are processed always before exclusions. This leads to situation that the data that we explicitely wanted to backup, are not processed because of following exclusions. Ie.: rule 1: max_file_size=100Mb rule 2: include_ext=ost ->outlook profile larger than 100Mb rule 3: include_ext=pst ->outlook archive larger than 100Mb

It would be great if the first limit could have been overriden by explicitly set include rules. In general when includes/excludes hierarchy is build it could be build in order by which the rules where put in configuration files, or order of incexc files included, giving the priority for later rules.

Of course rules logic could also be nice :) (max_file_size=100Mb and not (file_ext=ost or file_ext=pst))

grke commented 9 years ago

Hello,

Thanks for your idea, but I don't like the rules hierarchy thing for these reasons:

a) It is very complicated to implement (for various reasons that I won't go into). b) It would be difficult for people to understand. It is not obvious what will be different if rule 1 was after rule 2, for example. c) Upgrading from previous versions will result in unexpected behaviour since the ordering of options would now matter. d) I think you haven't understood what 'include_ext' is supposed to do. Using 'include_ext' means that only files with the given extensions will be considered for backing up. And you don't want to back them up if they are larger than 100Mb, but you do want them to be backed up(!)

The logic idea seems better. I guess it could look something like this: include = C:/ exclude_logic = (file_size>100Mb and not (file_ext=ost or file_ext=pst)) exclude_logic = (file_size>200Mb and file_ext=ost) exclude_logic = (file_size>300Mb and file_ext=pst) This also looks complicated to implement, and I don't imagine that I will get around to it any time soon. Maybe in burp-3.

juodumas commented 8 years ago

Posting my use case for include groups (per hierarchies rules) and two suggestions about how they could be handled. Slightly rewritten from the variant I posted in #414. Please tell me what you think about the suggestions when you get a chance.

Use Case

# Hierarchy 1
include = c:/users/me
include_ext = doc
include_ext = xls

# Hierarchy 2 (don't want the above include_ext to apply to this hierarchy)
include = c:/app/app.exe
include = c:/app/data
exclude_regex = /bak/$

Solution 1 (implement `include_regex`)

include = c:/users/me
include = c:/app
include_regex = ^c:/users/me/.+\.(xls|doc)$|^c:/app
exclude_regex = ^c:/app/.+/bak/$

It would work on files found by include/include_glob directives. This means that include=... or include_glob=... would be required to use include_regex. Should be rather simple to implement and fit a number of use cases. But real per hierarchy rules would still be needed, because complex regexes are too hard on the eyes.

Solution 2 (ini style categories)

[hierarchy 1]
include = c:/users/me
include_ext = doc
include_ext = xls

[hierarchy 2]
include = c:/app
include = c:/app/data
exclude_regex = /bak/$

This obviously doesn't fit well with current config files because old configs would need to be put under a [general] section. Workarounds:

Use a separate file for include groups and load it with a new directive like . [path], e.g. .hierarchies [path].
Use the same config file, but wrap include groups:
```
[hierarchy]
include = c:/app
[/hierarchy]
```

grke commented 8 years ago

I think that include_regex is impossible to implement efficiently.

juodumas commented 8 years ago

Per my suggestion, include_regex would work exactly like exclude_regex, except negated. It would not include paths described in the regex, but rather filter paths already included with include and include_glob. So it would be as efficient as exclude_regex.

grke commented 8 years ago

OK, there is another issue for include_regex, and I posted your idea there: https://github.com/grke/burp/issues/3

ziirish commented 6 years ago

Hi,

I've been prototyping a PoC for a few days. It's currently implemented in python but I tried to keep as "low-level" as I could (ie. no Object use or the like).

Here is what it looks like:

$ python parser2.py -f burp.c -s 10000 'file_size>20Kb or file_ext=c' '(file_size>20Kb and file_ext=so) or file_match=burp' 'file_ext=c and file_size==20Kb' 'file_match=[bB].*\.c and file_size<=10Kb' 
--------------------
Evaluating 'file_size>20Kb or file_ext=c' with filename: 'burp.c', size: 10000 -> True
--------------------
--------------------
Evaluating '(file_size>20Kb and file_ext=so) or file_match=burp' with filename: 'burp.c', size: 10000 -> True
--------------------
--------------------
Evaluating 'file_ext=c and file_size==20Kb' with filename: 'burp.c', size: 10000 -> False
--------------------
--------------------
Evaluating 'file_match=[bB].*\.c and file_size<=10Kb' with filename: 'burp.c', size: 10000 -> True
--------------------
$ python parser2.py -f Burp.c -s 10000 'file_size>20Kb or file_ext=c' '(file_size>20Kb and file_ext=so) or file_match=burp' 'file_ext=c and file_size==20Kb' 'file_match=[bB].*\.c and file_size<=10Kb'
--------------------
Evaluating 'file_size>20Kb or file_ext=c' with filename: 'Burp.c', size: 10000 -> True
--------------------
--------------------
Evaluating '(file_size>20Kb and file_ext=so) or file_match=burp' with filename: 'Burp.c', size: 10000 -> False
--------------------
--------------------
Evaluating 'file_ext=c and file_size==20Kb' with filename: 'Burp.c', size: 10000 -> False
--------------------
--------------------
Evaluating 'file_match=[bB].*\.c and file_size<=10Kb' with filename: 'Burp.c', size: 10000 -> True
--------------------

Do you think of any other function/criteria to implement?

Here are the implementation details:

tokenize the expression => convert file_size>20Kb or file_ext=c into a list: [False, or_func, True]
recursively looking for triplets value1-func-value2 and then evaluate them like `func(value1, value2) until we don't have anything to look for anymore

The downside of my implementation is we have to evaluate the expression for every file!

This work has been heavily inspired by this stackoverflow reply.

parser2.txt

ziirish commented 6 years ago

I added support for negations:

$ python parser2.py -f nyan.so -s 200000000 'file_ext=so and not file_match=nyan' 'file_ext=so and not not file_match=nyan' 'file_ext=so and not file_match=burp' '(file_size>100Mb and not (file_ext=ost or file_ext=pst))'
--------------------
Evaluating 'file_ext=so and not file_match=nyan' with filename: 'nyan.so', size: 200000000 -> False
--------------------
--------------------
Evaluating 'file_ext=so and not not file_match=nyan' with filename: 'nyan.so', size: 200000000 -> True
--------------------
--------------------
Evaluating 'file_ext=so and not file_match=burp' with filename: 'nyan.so', size: 200000000 -> True
--------------------
--------------------
Evaluating '(file_size>100Mb and not (file_ext=ost or file_ext=pst))' with filename: 'nyan.so', size: 200000000 -> True
--------------------
$ python parser2.py -f nyan.pst -s 200000000 'file_ext=so and not file_match=nyan' 'file_ext=so and not not file_match=nyan' 'file_ext=so and not file_match=burp' '(file_size>100Mb and not (file_ext=ost or file_ext=pst))'
--------------------
Evaluating 'file_ext=so and not file_match=nyan' with filename: 'nyan.pst', size: 200000000 -> False
--------------------
--------------------
Evaluating 'file_ext=so and not not file_match=nyan' with filename: 'nyan.pst', size: 200000000 -> False
--------------------
--------------------
Evaluating 'file_ext=so and not file_match=burp' with filename: 'nyan.pst', size: 200000000 -> False
--------------------
--------------------
Evaluating '(file_size>100Mb and not (file_ext=ost or file_ext=pst))' with filename: 'nyan.pst', size: 200000000 -> False
--------------------

parser2.txt

ziirish commented 6 years ago

I started converting my PoC into C code integrated with burp, but I have a question regarding the evaluation of this feature. I'd like to implement both include_logic and exclude_logic options, but the include_logic part is more complicated than I thought.

The current matching code is like this:

static int my_send_file_w(struct asfd *asfd, struct FF_PKT *ff, bool top_level, struct conf **confs)
{
  if(!file_is_included(confs, ff->fname, top_level)) return 0;

  if(S_ISREG(ff->statp.st_mode) && !file_size_match(ff, confs)) return 0;

   /* rest of the code */
}

So I can easily add the is_logic_excluded check in the file_is_included function, but what about the is_logic_included?

My problem is with the file_size_match function because the logic expression can have the file_size check, or not. In case the expression has the file_size check, then we should/could bypass the file_size_match.

Do you have any idea on how to implement the include_logic function?

ziirish commented 6 years ago

I have a working implementation of the logic parser and evaluator:

$ ./burp -c /tmp/burp-server.conf
'file_size<10Kb or file_ext=so' with filename: burp.c, size: 10485760 => 0
'file_size>5Mb and file_ext=so' with filename: burp.c, size: 10485760 => 0
'file_size>5Mb and file_match=burp' with filename: burp.c, size: 10485760 => 1

Now I need to cleanup a bit the code (there are some memory leaks :( ) and then I'll write some unit tests.

I'm still looking for help on the include_logic implementation though.

ziirish commented 6 years ago

I have now fixed the memory leaks and some other issues I encountered while debugging those leaks.

Here are some more tests:

$ ./burp -c /tmp/burp-server.conf
------------------------------------
'file_size<10Kb or file_ext=so' with filename: burp.c, size: 10485760 => 0
'file_size>5Mb and file_ext=so' with filename: burp.c, size: 10485760 => 0
'file_size>5Mb and file_match=burp' with filename: burp.c, size: 10485760 => 1
'file_size<10Kb or file_ext=so' with filename: burp.c, size: 10485760 => 0
'file_size>=10Mb and (file_ext=ost or file_ext=pst)' with filename: burp.c, size: 10485760 => 0
'file_size>=10Mb and not (file_ext=ost or file_ext=pst)' with filename: burp.c, size: 10485760 => 1
'file_size<5Mb or (file_match=.*\..* and not (file_ext=c or file_ext=ost))' with filename: burp.c, size: 10485760 => 0
------------------------------------
------------------------------------
'file_size<10Kb or file_ext=so' with filename: mail.ost, size: 52428800 => 0
'file_size>5Mb and file_ext=so' with filename: mail.ost, size: 52428800 => 0
'file_size>5Mb and file_match=burp' with filename: mail.ost, size: 52428800 => 0
'file_size<10Kb or file_ext=so' with filename: mail.ost, size: 52428800 => 0
'file_size>=10Mb and (file_ext=ost or file_ext=pst)' with filename: mail.ost, size: 52428800 => 1
'file_size>=10Mb and not (file_ext=ost or file_ext=pst)' with filename: mail.ost, size: 52428800 => 0
'file_size<5Mb or (file_match=.*\..* and not (file_ext=c or file_ext=ost))' with filename: mail.ost, size: 52428800 => 0
------------------------------------
------------------------------------
'file_size<10Kb or file_ext=so' with filename: mail.log, size: 209715200 => 0
'file_size>5Mb and file_ext=so' with filename: mail.log, size: 209715200 => 0
'file_size>5Mb and file_match=burp' with filename: mail.log, size: 209715200 => 0
'file_size<10Kb or file_ext=so' with filename: mail.log, size: 209715200 => 0
'file_size>=10Mb and (file_ext=ost or file_ext=pst)' with filename: mail.log, size: 209715200 => 0
'file_size>=10Mb and not (file_ext=ost or file_ext=pst)' with filename: mail.log, size: 209715200 => 1
'file_size<5Mb or (file_match=.*\..* and not (file_ext=c or file_ext=ost))' with filename: mail.log, size: 209715200 => 1
------------------------------------

grke commented 6 years ago

Sorry, I haven't had a chance to look at this at all. Hopefully I can do it sometime soon.

grke commented 6 years ago

Looks very good!

I figured out something that is bugging me - I don't think the strsplit_w() and strreplace_w() functions belong in src/alloc.c, because that is for the lowest level allocating and free-ing stuff (malloc,realloc,strdup,etc) - I think they should be in a separate file.

I don't think you should need to create extra free_X() functions. free_p() doesn't NULL the pointer, and neither does free_c(), which makes me nervous. The one call to free_p() could just be a free_v((void **)ptr). I don't quite follow what free_c() is doing yet. ... Maybe the previous xfree_list() was broken? - it probably never had a unit test whilst it was in the glob_windows stuff.

grke commented 6 years ago

In case that was confusing - there are two main points: 1) only memory allocating system calls should be in alloc.c. 2) Any calls to 'free()' need to be in alloc.c, and they need to NULL the things they free so that those things can never be double-freed.

ziirish commented 6 years ago

I understand your point maybe they can be added to the handy.c file?
Something bothered me with your free_v function: if(!ptr || !*ptr) return;. This test prevented me to free my containers. I'll give it another try, but the free_v and free_w functions gave me some headaches.

ziirish commented 6 years ago

I moved the split/replace functions in handy and added two "_noescaped" variant to prevent replacing/splitting escaped chars as discussed in #754.

I did some other tests and was able to get rid of the free_p function, but I still need the free_c variant because of this:

99%: Checks: 351, Failures: 1, Errors: 0                                                                                                         
utest/test_alloc.c:16:F:Core:test_find:0: Assertion 'free_count==alloc_count' failed

==24699==
==24699== HEAP SUMMARY:
==24699==     in use at exit: 224 bytes in 3 blocks
==24699==   total heap usage: 1,057 allocs, 1,054 frees, 47,050 bytes allocated
==24699==
==24699== 224 bytes in 3 blocks are definitely lost in loss record 1 of 1
==24699==    at 0x4C2BBAF: malloc (vg_replace_malloc.c:299)
==24699==    by 0x10F62C: malloc_w (alloc.c:72)
==24699==    by 0x11E38E: charsplit_noescaped_w (handy.c:742)
==24699==    by 0x127860: create_token_list (find_logic.c:200)
==24699==    by 0x127860: parse_expression (find_logic.c:215)
==24699==    by 0x127860: eval_expression (find_logic.c:623)
==24699==    by 0x10F293: real_main (prog.c:472)
==24699==    by 0x10F293: main (prog.c:602)
==24699==
==24699== LEAK SUMMARY:
==24699==    definitely lost: 224 bytes in 3 blocks
==24699==    indirectly lost: 0 bytes in 0 blocks
==24699==      possibly lost: 0 bytes in 0 blocks
==24699==    still reachable: 0 bytes in 0 blocks
==24699==         suppressed: 0 bytes in 0 blocks
==24699==
==24699== For counts of detected and suppressed errors, rerun with: -v
==24699== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

Where charsplit_noescaped_w (handy.c:742) is if(!(ret=malloc_w((count+1)*sizeof(char *), func))) goto end; so your free_v function doesn't deal well with char **.

ziirish commented 6 years ago

While running some tests, I thought it could be nice to have a path_match test and a file_match test with the path_match acting on the full pathname (/home/user/file1) while the file_match would only evaluate the filename (file1).

I think the current implementation of the file_match (which would become path_match) is not really intuitive and make it hard to implement complex rules such as exclude all filenames matching this regex and being bigger than 50Mb

What do you think?

grke commented 6 years ago

diff --git a/src/handy.c b/src/handy.c
index ce86f3a3..108f2339 100644
--- a/src/handy.c
+++ b/src/handy.c
@@ -802,7 +802,7 @@ void free_list_w(char **list, size_t size)
                for(i=0; i<size; i++)
                        if(list[i]) free_w(&list[i]);
        }
-       free_c((void **)list);
+       free_v((void **)&list);
 }

 // Strip any trailing slashes (unless it is '/').

This seems to work for me, but I would prefer to pass the variable into free_list_w() as char ***list, so that the original ends up NULLed (in the thing that calls free_list_w()).

Like this:

free_patch.txt

grke commented 6 years ago

By the way, I haven't figured out how size can be <0 in free_list_w(). How is that possible?

grke commented 6 years ago

I think it is a good idea to be able to choose to match the actual filename or the full path.

ziirish commented 6 years ago

OK, thanks, I'll apply your patch.

The char ** returned by the split functions should be null terminated, so you can use the free_list_w with a size <0 so the functions iterate through the list until it finds NULL.

I actually ported this function from an old project of mine where I used to modify my list size without being able to properly update its length hence the "-1" trick.

ziirish commented 6 years ago

hm, ok I see the problem with the size<0, size_t is an unsigned int, so it is impossible to have it <0.

I guess I'll just remove that part since it is not useful for burp.

ziirish commented 6 years ago

The pull request is now up to date with all the fixes and I have also implemented the 'path_match' test.

grke / burp