isi-vista / adam

Abduction to Demonstrate an Articulate Machine
MIT License
11 stars 3 forks source link

Linking scene images to linguistic output #1092

Closed lichtefeld closed 2 years ago

lichtefeld commented 2 years ago

Currently there's no great way to link an object in the scene RGB image to the linguistic output produce by ADAM. We have a few signals we can track to coordinate this. In addition several proposals have been made as a way to solve the problem. I'm aiming to summarize the problem, the potential signals available to remedy the problem & their corresponding solution. @gracemcclurg If you'd review the information here and pick a solution as well as provide a more explicit implementation of the solution for me to comment on that would be great. After I've commented on the implementation it'll be one of your next tasks to implement.

Problem

Take a look at the following image. Both objects are described as 'cube' and while in this situation color could be used to link the objects description to their location in the image in a scene where both objects are of similar color this is a problem.

Screen Shot 2022-01-27 at 5 28 23 PM

Possible Solutions

Square bounding box around the object with either a) label or b) color alignment (or both). This is the semi-standard overlay option. We should have (or can confirm with ASU that it's available) the raw stroke extraction information which is based on the pixels in the image to align to the object center. We can then use some of the stroke information to rebuild a square bounding box.

The second option of a solution is similar but rather than making a square bounding box we overlay the extracted strokes (all normalized to the same color which is unique for each object). This serves to also make the feature extraction of strokes even clearer on the primary display and switching to the view for each object is an enhancement and shows the results of normalization etc.