montera34 / pageonex

PageOneX. Analyzing front pages
http://pageonex.com
GNU Affero General Public License v3.0
54 stars 13 forks source link

Export Area Raw Data has a bug in area_y2 for some of the areas #224

Open numeroteca opened 5 years ago

numeroteca commented 5 years ago

Looking for the pattern to find where the problem is in the generated json. Not all areas are affected.

area_y2: ha.areas.first.y2, https://github.com/montera34/pageonex/blob/master/app/models/threadx.rb#L315

numeroteca commented 3 years ago

Areas are defined with the coordinates of the first corner (top left: area_x1, area_y1) and the coordinates of the opposite corner (bottom right: area_x2, area_y2). Besides the width and height of the area are calculated and provided in the downloadable json file (area_width and area_height).

This bug can be seen in this file: http://pageonex.com/numeroteca/corrupcion-spain-enero-2013/raw.json The area 299 has a position of point area_y2 in a position greater than the height of the image (which is 1049) Screenshot from 2021-09-14 01-11-37

The height of the rectangle area area_height is well calculated according to the rectangle in the thread, but if you try to get that result with area_y2) -area_y1` it results in a wrong number.

I've looked in the 4 years old compilation of threads of colorcorrupcion and the behavior does not seem to follow a recognizable pattern or a correlation with date, month, year, size of newspaper. There are buggy area_y2 in all the newspapers, dates and topics.

In the column diff we show the difference between the calculated by Ruby height area_height and the calculated with the raw json file data height_new: height_new - area_height. The area_height seems to be correct when looking graphically in a thread.

I've looked for y2 in the repository https://github.com/montera34/pageonex/search?p=1&q=y2 and I am not seeing weird things. I am looking for a wrong calculation when the data are exported to the Raw json areas file that seems to be defined here https://github.com/montera34/pageonex/blob/master/app/models/threadx.rb#L300.

Screenshot from 2021-09-14 01-25-46

And now looking at the relationship height_new /area_height I see a pattern around numbers 1, 2, 3 and 4, being stronger at 2, where the expected behaviour would be to see all 1: Screenshot from 2021-09-14 01-38-42

Which could be the reason for that?? what is the calculation artifact that generates this?

I replicated the analysis with the width area_x2 and though there are differences among the calculated width and the one in the raw json areas export file, it is not that important.

PS: I see a maybe a typo at https://github.com/montera34/pageonex/blob/b5da84415a87a9c5e49b5728aed0aa498106e83f/app/assets/javascripts/coding.js#L195 inside the function enableDragging.

ha.width = ui.size.width;
ha.x2 = ha.x1 + ha.width;
ha.height = ui.size.height;
ha.y2 = ha.y2 + ha.height;

In the last line I'd expect to see ha.y2 = ha.y1 + ha.height;, but I guess this has no influence in the discussion above.