isi-vista / adam

Abduction to Demonstrate an Articulate Machine
MIT License
11 stars 3 forks source link

Online Decode Pipeline for Objects #1155

Closed lichtefeld closed 1 year ago

lichtefeld commented 2 years ago

The following is my current outline for the needed components that need to be implemented for a decode pipeline. First I've listed how I think the live demo should be structured for runtime and then below each needed component has a heading for further details.

Demo Outline

The input is a PNG image, this needs to have some form of segmentation applied and then stroke preprocessing needs to occur. To prevent needing to make multiple command line calls in the live demo my thought in the script for the demo should just invoke a subprocess for preprocessing and wait for it to complete before continuing on to ADAM decode. Once stroke preprocessing has occurred the scene can just be loaded into ADAM and decoded as normal into a specific directory for live demos. Once the 'complete' (or similar message) has appeared in the terminal I'll just refresh the available scenes in the UI and select the newly decoded scene.

Demo Script

Ideally this demo runtime environment can load a pickled ADAM learner for decoding so we can train a learner offline and not need to run any training prior to the demo. This would allow us to potentially invoke the entire pipeline for each image to be processed but I think being able to load the learner configuration once and then pass in files to process would be better for a demo. This requires:

Preprocessing

The preprocessing work is in progress for integration by @spigo900. An ideal goal is a single python script entry point for preprocessing an individual image rather than an entire curriculum at once.

Segmentation

@spigo900 Noted that the current preprocessing requires a segmentation file. We won't have that from just a PNG so how could we acquire that in the easiest way? Alternatively, can stroke extract run directly on the PNG images with success.

Decode

We'll need to have a trained ADAM learner for live decode. It may be nice to have models trained with different amounts of data to investigate any reasonable differences in decode output but mostly I think we should use the model with essentially all available training data produced used for decode.

Testing

I plan to take ~3-5 images of a chair & table from my apartment for some basic testing images. I can look through the other objects we have to see what else I could easily acquire an image of. This will hopefully allow us to test this implementation well before the live demo.

@spigo900 @sidharth-sundar @marjorief -- Any comments or concerns with this general layout?

spigo900 commented 2 years ago

General thoughts

I think driving most of the work using a script is probably a good idea. Preprocessing has an annoying number of steps at the moment (strokes->train->decode train->decode test). So it sounds like this script would do validation on the path inputs, invoke and run preprocessing steps, then run ADAM decode? I think that makes sense. The fewer commands we have to enter live, the better. 🙂

Preprocessing

Segmentation

On segmentation, I'm not sure the easiest way, though I found a little info. I did a quick search but didn't find (probably for lack of the right terms) an obvious "plug and play" model setup. From my search though it sounds like the search term for what this model does is either "semantic segmentation" (which is by type of object, e.g. all balls in the scene are marked the same) or "instance segmentation" (which is by instance)? I don't know what we're using now but given the files are named semantic_(stuff).png I'd guess semantic segmentation. A complication is that it looks like some such models rely on a separate object detection step (to "find the bounding boxes") and that would make things more annoying.

This leads to some questions for @shengcheng and @blakeharrison-ai. IIRC @blakeharrison-ai and @shengcheng do the segmentation (for the semantic_(stuff).png images) using a model that's integrated with the simulator -- is that true? Assuming it is, how easy would it be to run the same segmentation setup/model outside of the simulator on a raw PNG image? Also, is "semantic segmentation" the right "search term" for a model that does what this step is doing, or is there more to it?

Missing strokes

I think as long as we have features.yaml, even if it's trivial, ADAM shouldn't crash when decoding. For some images we might not end up with any strokes, hence no decode and no features.yaml. As part of #1151 I'll need to fix the modified GNN code to always save at least a trivial features.yaml -- as things stand it is not doing that.

Speed

Speed-wise, I think we're okay for a small-scale demo. Stroke extraction is relatively slow and may be the slowest part right now, but probably doable for small numbers of images. On my local machine it takes ~5 seconds to extract strokes from one image. I haven't checked GNN decode speed locally yet. Hazarding a guess, I think we could handle about 50 images in ~5 minutes or 10 images in 1 minute.

Larger scales (>50 images) would be harder to do live. I don't know that we need to worry about it given there probably won't be time to go over 50 samples in detail in a demo regardless.

If necessary though I think we could solve the "lots of samples, live" issue by extracting/decoding in parallel with the display. I think we If we can do (partial) display in tandem with decode then this might be fine although we'd have to check on how workable that is. The decode saves "live," and I believe ADAM saves its decodes that way too. The first complication is ADAM would probably need to wait on stroke extraction to complete -- as is it would see "no features here" and crash. This is solvable with some trouble to plumb through the logic of configuring retry time/max retries or however we set up the coordination. The second complication is that we have to worry about missing strokes. The complication is I'm not sure if the UI can currently read it in a partially-complete state. If we don't have info.yaml or if the number doesn't match the current number of sample directories that might cause problems for the UI.

Minor issue: Curriculum name & loading

One minor issue we'll want to address is "ADAM (phase3_load_from_disk) refuses to load curriculum names we haven't predefined in the code." So we'll have to either change that or define in advance a "whatever DARPA gives us at the demo" curriculum name to run on. Either should be easy to do though.

Questions

Setting up the UI to load new data automatically -- that sounds like a good idea. Does this require changes to the backend?

When you reference new learners, I assume that's referring to the experimental & observational affordance learners?

spigo900 commented 2 years ago

Actually, update on semantic segmentation -- it looks like torchvision already comes with some pretrained semantic segmentation models. So assuming that is the thing we want, we should have some easy options there.

ETA: on second thought, it looks like the selection of objects those models recognize does not quite match our own selection of objects, which might be an issue.

sidharth-sundar commented 2 years ago

I think as a whole this is a good and feasible idea, and I'll provide more specific comments + ask some clarifications for my sake below.

Demo Script

I think the idea of a single entry point for testing data makes sense, as well as your points on loading the pickled model and training it preemptively. Just as a clarification, you mentioned the following above:

Once the 'complete' (or similar message) has appeared in the terminal I'll just refresh the available scenes in the UI and select the newly decoded scene.

Part of the checklist includes allowing the UI to detect new data, so which of the following are we aiming to do?

  1. Keep the UI as is, and just refresh the page when we know the demo script has fully executed
  2. Have the UI present some notification that new data is available once the demo script terminates, then manually refresh the UI (this is what @lichtefeld suggests, from my understanding)
  3. Same as (2), but instead of manually refreshing, the UI can detect and load new data on the spot after the script terminates.
  4. If I'm understanding what @spigo900 proposed correctly, have a UI which provides a partial display that updates as new observations become available/as the script progresses in parallel.

I'd personally lean towards (2) as a shorter term solution if we have the time to implement it since it seems much neater than (1), and towards (4) if the partial visuals provide substantial insight into how adam recognizes objects.

Potential Issue: Formatting for UI

Besides the above, I think there are some formatting complications to work out. Specifically, the current pipeline requires that all training and testing data is presented in the following format in an input curriculum: train/situation_0, train/situation_1, ..., test/situation_0, test/situation_1, .... Then, once parsed, the individual situation directories are preserved and moved to a user specified destination folder, along with an associated post_decode.yaml for each situation directory. From my understanding of the UI, it preloads all existing data, and then the user specifies a situation number (0, 1, 2, ...) to load from the available testing data. The UI then searches for the directorydata/learners/${learner_type}/${curriculum_name}/test_curriculums/${test_curriculum_name}/situation_${num}. These 4 inserted variables - learner_type, curriculum_name, test_curriculum_name, num - are the only things the UI user can specify. Without matching that folder format, the UI won't be able to properly detect and present the data. Since the current plan is to invoke the script via command line for each individual input .png file, we'd need a way to determine those 4 variables The first 3 can probably either be pre-specified in the script or provided via some config file, but currently we don't have a way to track the num between script calls.

I think this ties in to @spigo900's mention of the UI not reading data in a partially complete state. Specifically,

If we don't have info.yaml or if the number doesn't match the current number of sample directories that might cause problems for the UI.

So we'll probably have to keep track of the number of sample directories with an incrementally changing info.yaml file, and in the demo script, we'll have to repeatedly load it to determine the next situation number.

In hindsight after writing this, this shouldn't be too big of a concern. The biggest complication here is that rather than testing all data in a single run, we test data in separate runs.

Preprocessing

Since I haven't worked on the preprocessing myself, I don't have too much to comment on with regards to segmentation. However, I think we could have a problem if an object is missing strokes entirely.

Potential Issue: Missing Strokes

From running the adam experimentation script and altering some testing data feature.yaml files, I couldn't get the adam scripts to pass the decoding phase, since the pipeline expects some stroke extraction to be there. I might be altering the feature files incorrectly, in which case this point is moot, but from my understanding, even if there is a trivial feature.yaml, we'll need to change the extraction script to appropriately handle parsing feature files without stroke extraction.

Testing

In terms of other objects we can take pictures of, there are some around the office we could get (mug, cup, floor, and paper come to mind.

Questions

The biggest question has less to do with the pipeline itself and more with the input data. This will probably be answered in evaluating this pipeline, but is there a significant difference between running on simulated object data and running on real life object data?

For instance, with floor, the simulation has a rectangular, flat shape in an empty space, but if one were to take a picture of a floor in the real world, the floor may span the whole picture, in which case I'm not sure how exactly stroke extraction would function.

Also consider window. In the simulation, the picture is of a window entirely by itself, but a picture of a window might also show what's on the other side of the window, adding noise to the image.

spigo900 commented 2 years ago

Noting briefly here that I'm asking @sidharth-sundar to handle (semi-?)live reloading in the UI: "A configuration with the UI that can detect new available data for display." I don't think I have more details to add about that task.

ETA: Re: Sid's comment above, I think solution (2) is fine?

spigo900 commented 2 years ago

@sidharth-sundar brings up a good point about structure. We probably want to assume in the script that you want to put the output where the backend/UI can find it. I think we can make the learner type, train curriculum, and test curriculum optional arguments with some reasonable defaults -- for learner type we'll probably only ever use the simulated integrated learner for example, and I think we can assume a predetermined test curriculum name, say m6_objects_live_demo. As for num -- I think we can just take all of the PNGs at once in some arbitrary order and leave this implicit.

Re: strokes, I had been thinking of outputting objects: [] when missing strokes. @sidharth-sundar, have you found this to cause problems?

sidharth-sundar commented 2 years ago

Re: strokes, I had been thinking of outputting objects: [] when missing strokes. @sidharth-sundar, have you found this to cause problems?

Ah, I didn't know the appropriate output formatting. It runs fine with objects: []

lichtefeld commented 1 year ago

Closed by #1159