juliema / label_reconciliations

Code for reconciling multiple transcriptions for a label
MIT License
26 stars 11 forks source link

Support for polygon data type #78

Open denslowm opened 1 year ago

denslowm commented 1 year ago

In support of LightningBug and others using polygon data from the Zooniverse I am requesting support for polygon type data.

This has two general areas that need attention

  1. Support for polygon type data in unreconciled outputs
  2. Support for polygon type data in reconciled outputs

Reconciliation outputs for polygon will require some scoping out and discussion in order to understand what will work best for end users.

cc @PmasonFF

PmasonFF commented 1 year ago

LightningBug uses a looping structure to resuse a drawing task (polygon) once for each label in the image (normally two to six labels). This and the polygon tool itself is going to complicate matters.

Firstly we now know that in the new FEM projects, looping over drawing tools causes significant issues - basically collecting all the drawings into one group in the order they are drawn with no certain correlation to the label they refer to (since parts of any loop can be redone in any order using the back button since persist annotations is on by default) There is no plans to change this from zooniverse end, see discussion here: https://www.zooniverse.org/talk/18/2881417 This means any development we make here is likely for PFE projects and will be made obsolete quickly.

Secondly the reuse of tasks introduces complications in labeling the flattened responses, even in a PFE project with persist annotations off. Each usage has to be parsed based on the order it was done, and linked back to a specific label based on other responses to other tasks done in each loop such as label number. If I understand correctly reconcile,py is currently not set up to deal with looping through any of the supported drawing tools -currently drawing outputs are labeled by the task number with no allowance for multiple uses of that task..

Personally (and your mileage may vary) I think NfN should consider NOT adding flattening of drawing tools inside reconcile.py at all - keep reconcile.py simple and clean reconciliation of questions and transcription tasks. Flattening of drawing tools tends to become very project/workflow specific - example the Bee Bonanza measurement workflows requires flattening of line drawings. We already have reconcile.py sorting out lines into those meant for capturing scale and those which are for measurement - using regex on the specific task labeling used in Bee Bonanza and very specific to that project. Further only the length of the scaled measurement lines are reconciled using MEAN which is vulnerable to outliers. The code as it sits would be useless in a workflow where the line location meant something (as the polygons in lightning bugs will be used). Rather I think these specific workflows/tool uses should be flattened in more project specific scripts, with reconcile.py used to reconcile any transcript/question tasks on the flattened file using the existing .csv capability.

And then there are the issues of flattening Polygon data outputs. Polygons are drawn from a starting point in line segments until the figure is closed (with a double click or clicking on the starting point). There is no control of the point chosen to start so one can not simply aggregate by looking at the mean of the points in any order (which could also be clockwise or counterclockwise). To aggregate polygons, even those drawn precisely around defined areas requires using a clustering algorithm such as DBSCAN. All the points for each polygon drawn by each volunteer are fed to the clustering algorithm, and hopefully clusters are found for each vertice, with outliers ending up as "noise".
If the volunteer can draw multiple polygons this clustering can result in vertices of one polygon clustering with vertices of some other polygon - this requires some means of keeping the individual polygon points separate and matching up which polygon is which between volunteers This can be handled two ways - either the workflow forces an order to the polygon drawings (currently not possible with FEM), or each polygon is reduced in a way that it can be separated from others - for instance by calculating the centroids (or centers of mass) of the all the polygons and clustering those in a way that polygons marking the same things do cluster and can be separated across volunteers.

Then there are some issues specific to the Lightning bug project other than the looping issue.
The subjects for this project are composite images of 15 views of the sample and its multiple labels from different angles. Due to the goals of the project the desired output will be the actual consensus polygons that best outline each label in the view which shows it best. Various volunteers may chose different views for any particular label based on their opinion of which view is best. Even after separating out individual labels, we may be faced with clustering points which cluster into the vertices of multiple polygons in different views- depending on the clustering algorithm some points may simply end up as noise, and the issue can perhaps be handled by increasing the retirement limit and accepting only the most popular version for each label's polygon.

Another issue is the fact the various labels in these subject are vertically stacked, and then the best view is at an oblique angle where one can see the most of the label below another. This causes two problems - firstly the "far edge" vertices of the lower label may coincide quite closely with the "near edge" vertices of the upper label - requiring we know exactly which points to cluster with which... The second problem is the volunteer may draw a four point polygon outlining the part of the label that can be seen, but more likely they will draw the polygon where they assume the edge of the label is when one corner of it hidden under an upper label. Or worse, they will use six or more points to outline partially obscured labels. This is going to complicate the clustering and impact the accuracy of the consensus polygons .

I do not feel these issues are terminal, but I do think it would be better to resolve them in a customized script for the project, and use reconcile.py for reconciling the transcription part after the appropriate label transcriptions have been aggregated.

denslowm commented 1 year ago

Thanks @PmasonFF There is a lot to unpack here and I fear that most of it is beyond my technical abilities / understanding of the data structure.

We may have to wait for Rafe to weigh in when he is able.

PmasonFF commented 1 year ago

More issues specific to the workflow for Lightning bugs:

As it stands you can give two labels the same number - the looping workflow does not have any means to prevent this. You could possibly avoid the looping and have many more tasks asking the same questions in the same order, except the label in question would be indicated in the task questions ( example: Which view show the second label from the top best? ) The downside is the group of tasks would have to be repeated as many times as there are possible labels in the worst case (at least six times from what I have seen). This would make flattening easier since each instance of the tasks would have a new task number!

There is nothing in the polygon drawing tool that prevents "intersecting" polygons - ie one can criss-cross line segments so a rectangle looks like two triangles pointed at each other. I think this would real mess with the AI and I know it would bomb a centroid calculation if we ever had to use that to cluster polygons. Again there is no way to prevent this in the workflow.

I am beginning to really detest this polygon tool! I do have code that finds the centroid for non-intersecting polygons if we ever need it.

def find_centroid(polygon):
    p0 = polygon[0]  # any arbitrary vertex can be used
    centroid = [0.0, 0.0]
    area = 0.0
    for j in range(len(polygon)):  # step thru polygon defining triangles (p0, pj, pj_1)
        p1 = polygon[j]
        p2 = polygon[j - 1]
        f = ((p1[0] - p0[0]) * (p2[1] - p0[1]) - (p2[0] - p0[0]) * (p1[1] - p0[1]))/2  # triangle area
        area += f  # total area
        centroid[0] += 2/3 * ((p1[0] + p2[0])/2 - p0[0]) * f  # sum center of mass for triangle areas about p0
        centroid[1] += 2/3 * ((p1[1] + p2[1])/2 - p0[1]) * f
    centroid[0] = centroid[0] / area + p0[0]  # find final moment arm wrt p0, and translate to origin
    centroid[1] = centroid[1] / area + p0[1]
    return centroid

and it is simple enough to convert the dictionary format for the polygons returned in zooniverse into a list of vertices - example:

polygon_dict_a = [{"x": 2494.43, "y": 2377.52}, {"x": 2658.78, "y": 2313.61}, {"x": 2699.1, "y": 2360.79},
                  {"x": 2533.24, "y": 2426.98}]
polygon_a = [(pt['x'], pt['y']) for pt in polygon_dict_a]
print(find_centroid(polygon_a))
rafelafrance commented 1 year ago

There's a lot to take in here.

You are a volunteer and this project is actually quite hard. Additionally, I've done something similar before so it may make more sense for me to do this.

============== Let me start with: The reconciler is anything but clean. It's more of a set of contained messes... but I digress.

I have some experience working on the label finder expedition. That was having people draw rectangles around labels and identify the type of label. I encountered many of the same issues you mention. It was easier given that it was always rectangles and not general quadrilaterals but overlapping boxes and people drawing rectangles in weird ways did happen. Some of the issues I encountered were:

I wound up throwing out a lot of data. In hindsight, I threw out too much data. My mistake was considering everything as a box and not a set of vertices.

I can dig up my old code, clean it up (it really really needs it), adapt it to general quads, and put it in the reconciler.

PmasonFF commented 1 year ago

Ok keep in mind polygons vs rectangles - there are eight ways to draw the exact same four vertices polygon, (four points one can start at and two directions to go ) - with rectangles there was one way to draw any given rectangle - it makes aggregation much more of a pain.

Can I ask, why the reluctance to do a script specific to this project? Why do you want it inside reconcile.py? The only part of reconcile.py that applies to this workflow is the reconciliation of the transcripts, all the clustering and sorting out the looping will be unique to this workflow and unlikely to ever be used in the same way again.

PmasonFF commented 1 year ago

BTW I am a volunteer but I have done data analysis for well over 30 projects including several with point line circle ellipse and rectangle drawing tools. I have contributed code and other assistance to another 30 or more projects. I find the issues are never in the coding, but handling uncertainty in the results. I think it safe to say that the large majority of the several thousand hours I have put in have been directed at finding ways to compensate for the innovation of classifiers and their ability to find unique ways to mess up the data :) Unfortunately that is an integral part of citizen science..