Tagging feature to somehow mark interesting files

axxel commented 7 years ago

Somewhat related to #30 is the fact that the different users of the sample set have different needs. As I am working on RawSpeed I am interested in a set of files that covers as much (distinct) camera models as possible, while being minimal at the same time. My current regression test with ASAN enabled takes a few minutes on the complete set, which is not really sexy already. What I do not care about is different ISO settings for the same model or even different files coming the very same hardware just with a different branding. Different ISO samples on the other hand are very useful to come up with proper WhitePoint numbers.

Bottom line: Ideally I'd like to be able to somehow mark files as 'not interesting' for rawspeed regression tests and then later download/sync only those that are (see #25). The question remains on how I would find out which are interesting and which are not.

A good first guess would be to have only one from each "make / model / mode".

I could envision some kind of automatism using code coverage analysis and then adding files and one that does not increase the coverage any more gets to be ignored in the future.

Not sure if all that would be worth it, it would be just nice if I had it (right now) ;).

LebedevRI commented 7 years ago

What I do not care about is different ISO settings for the same model ... Different ISO samples on the other hand are very useful to come up with proper WhitePoint numbers.

We do not take such iso/shutter/aperture sets anyway.

A good first guess would be to have only one from each "make / model / mode".

We are currently trying to keep one raw per "$make $model $mode", where $mode can contain raw compression, bitness, aspect ratio. Which is what you seem to ask for.

axxel commented 7 years ago

We do not take such iso/shutter/aperture sets anyway.

But you sometimes need a complete iso-set and ask people if they could provide one. It is likely someone would use this new tool to provide them to you, right? You could of course delete them later again... or keep them, since you have them but somehow tag them. Since I don't really care about them, I am fine with getting rid of them.

We are currently trying to keep one raw per "$make $model $mode", where $mode can contain raw compression, bitness, aspect ratio. Which is what you seem to ask for.

Absolutely. Have you tried to find out if all the files in the current set are unique regarding this 'key'?

LebedevRI commented 7 years ago

But you sometimes need a complete iso-set and ask people if they could provide one.

Yes.

It is likely someone would use this new tool to provide them to you, right?

No, i'm quite sure that it won't happen. That is just out of the question. Since the beginning, the goal of r.p.u was to have unique samples. And different iso/aperture/etc is not a criteria that is taken into account. So the whole r.p.u archive should be just fine.

You could of course delete them later again... or keep them, since you have them but somehow tag them. Since I don't really care about them, I am fine with getting rid of them.

Right now in case of such duplicates, just one is picked and verified, and the rest is deleted.

Absolutely. Have you tried to find out if all the files in the current set are unique regarding this 'key'?

Well, i'm pretty much sure that there were no duplicates in the samples imported from rawsamples.ch set, and there certainly is no duplicates in the samples that were uploaded to us and we verified. So no i did not try, but it should be good as-is..

I could envision some kind of automatism using code coverage analysis and then adding files and one that does not increase the coverage any more gets to be ignored in the future.

BTW, just in case you are not aware of it's existence, https://github.com/mirrorer/afl/blob/master/afl-cmin But i would be surprised if it works with such big files. At least the afl itself does not, last time i checked.

http://llvm.org/docs/LibFuzzer.html should work better, but i did not really check

If you have a large corpus (either generated by fuzzing or acquired by other means)
you may want to minimize it while still preserving the full coverage.
One way to do that is to use the -merge=1 flag:

mkdir NEW_CORPUS_DIR  # Store minimized corpus here.
./my_fuzzer -merge=1 NEW_CORPUS_DIR FULL_CORPUS_DIR

rawspeed fuzzing is somewhat far-away target...

andabata commented 7 years ago

In all, i ithink that adding a tagging feature+custom download is a bit out of the current scope. If this needs to be implemented it would also mean that some kind of user administration is needed. And apart from admins, it's not really something i want to make right now.

patdavid commented 7 years ago

I agree with @andabata. Though, it might be a fun exercise in storing tagged hashes on the client side with localstorage... :)

pixlsus / raw

Tagging feature to somehow mark interesting files #32