marco-c / autowebcompat

Automatically detect web compatibility issues
Mozilla Public License 2.0
34 stars 41 forks source link

Label dataset #2

Open marco-c opened 6 years ago

marco-c commented 6 years ago

The labeling can be performed using the label.py script.

This script will show you a couple of images, and then you can press 'y' to label them as being compatible, 'd' to label them as being compatible with content differences (e.g. on news site, two screenshots could be compatible even though they are showing two different news, simply because the news shown depends on the time the screenshot was taken and not on the different browser), 'n' to label them as not being compatible, 'RETURN' to skip them (in case you are not sure yet), 'ESCAPE' to terminate the current labeling session and store the current results.

More details about the three-labeling system are present in the documentation at https://github.com/marco-c/autowebcompat#labeling.

iamvc7 commented 6 years ago

@marco-c A CNN learns more about the patterns in the image (Edges, Corners and their correlations) from example 2 it is evident that it will be difficult for a NN to learn the adversary and classify that both are compatible.

To detect differences, Y+D and N in a better way or even Y and D+N, I think we can focus more on, Finding ROIs (Attention based) and feed those patches to the NN. This can be our next go-to-go (alternative) if nothing works very well after training part which you suggested.

nok commented 6 years ago

At the beginning I would start with screenshots based on equal page sources (same content), so only Y vs D+N. Furthermore I would try to normalise the device settings to bring the rendered Firefox version closer to the rendered Chrome version. And maybe we could remove the system look and feel elements by injecting a small script before the screenshot will be taken.

Shashi456 commented 6 years ago

@marco-c i'd like to label parts of our dataset, how do you suggest i go about doing that ? because as far as i've seen there is no script which merges labels from the label_persons directory into the actual labels directory .

sagarvijaygupta commented 6 years ago

@Shashi456 I think you are talking about generate_labels.py.

Shashi456 commented 6 years ago

@sagarvijaygupta oh , i thought it wasn't updated for the new files :P , but regardless should we not spend some time labeling the dataset we may need it this summer

marco-c commented 6 years ago

@marco-c i'd like to label parts of our dataset, how do you suggest i go about doing that ? because as far as i've seen there is no script which merges labels from the label_persons directory into the actual labels directory .

The script hasn't been updated yet to deal with bounding boxes, but you can already start labeling and pushing your labels file to the repo. Then, once we have the script done, we will actually combine the labeling done by you and the labeling done by other persons.

sdv4 commented 6 years ago

I am running label.py on my mac, and I am finding that it is slow or unresponsive on non-y images. For instance, it takes a long from when I try to drop a boundary box to when it shows up and for the 'T', resizing arrow, and movement arrow show up. Clicking on any causes everything to disappear until I release my mouse + a couple of seconds.

Is this a problem that anyone else has come up against?

marco-c commented 6 years ago

It could be a Mac issue, I think nobody has tested it on a Mac yet. Could you try in a Linux VM?

sdv4 commented 6 years ago

@marco-c I am not having that problem on the Linux VM, so I can label a lot faster now. A couple of questions:

sdv4 commented 6 years ago

Also, how would you label a pair of images when they show the same page except that one is in English and the other in Italian?

sagarvijaygupta commented 6 years ago

@sdv4 you can take help from the #220 till it is merged. Those screenshots are marked by @marco. For the last one you should mark them incompatible while drawing bounding box on Italian side.

marco-c commented 6 years ago

Getting my labels into the main repo: Should I open a PR for a new branch off of my forked master that is the same as the upstream master, except that it includes my new labels?

Yes! You can open a PR that says "Add some labels from Shane Sims".

marco-c commented 6 years ago

Are the other two questions answered by #220?

sagarvijaygupta commented 6 years ago

@marco-c For the scroll one we have marked them as incompatible in screenshots, and for italian one we mark bounding boxes in italian side with incompatibility in #220 .

marco-c commented 6 years ago

For the scroll one we have marked them as incompatible in screenshots

IIRC I've marked them as compatible, didn't I?

marco-c commented 6 years ago

No maybe not, they should be incompatible (e.g. if clicking on a button causes a scroll in one browser, it should cause a scroll in the other browser too).

sagarvijaygupta commented 6 years ago

https://github.com/marco-c/autowebcompat/blob/b18eae0999b6389b1cc84153f43b077eddccd9d8/collect.py#L162

And if this script works differently on two browsers then also it should be an incompatibility?

marco-c commented 6 years ago

And if this script works differently on two browsers then also it should be an incompatibility?

It shouldn't, but it's hard to tell whether it was this script that failed or something else. Maybe we should just assume this always works.

sagarvijaygupta commented 6 years ago

Okay!

Shashi456 commented 6 years ago

@marco-c @sagarvijaygupta so while i was labeling the dataset one of the major themes that popped up was how chrome had a scrollbar. Almost all images which have a scrollbar are very similar but the scrollbars adds a shift which makes the overlay look incompatible .

Should we update the crawler options for chrome to remove the scroll bar or suggest the user something accordingly in the labeling guide?

sagarvijaygupta commented 6 years ago

@Shashi456 it is already removed from the crawler.