sneheshs / NATSGD_Comm2LTL

NatSGD Dataset Human Communication to LTL Benchmarking (https://www.snehesh.com/natsgd/)
MIT License
2 stars 0 forks source link

About experiment setting and metrics #1

Closed Xianqi-Zhang closed 7 months ago

Xianqi-Zhang commented 7 months ago

Hi, Thank you for sharing. The simulation environment and multimodal data are really cool..., but I have some problems about experiments.

  1. For Fig.5 (the learning framework), it seems that the environment state is not used (only input human gesture and audio). Is this means in your problem formulation and experiments, the proposed NatSGD is not used?
  2. In other works, they often setup tasks and use success rate as metric. But in this paper, the problem formulation is "converts a pair of speech and gestures into an LTL formula", and use JaqSim and SpotScore as metrics. Why is it set like this?
  3. Are the evaluation criteria set appropriately? (a) First, it may be better to include more redundant actions that can complete the task than to have fewer actions that cannot complete the task. (b) Second, the sequence of actions may be very important for completing a task, but it does not seem to be reflected in evaluation metrics.

Thank you for any reply.

Best Regards.

sneheshs commented 7 months ago

Thank you for the questions. Please see our response inline.

Hi, Thank you for sharing. The simulation environment and multimodal data are really cool..., but I have some problems about experiments.

  1. For Fig.5 (the learning framework), it seems that the environment state is not used (only input human gesture and audio). Is this means in your problem formulation and experiments, the proposed NatSGD is not used?

The NATSGD dataset and this paper demonstrate its usefulness and offer opportunities to the Human-Robot Interaction community for future work on many other tasks where perhaps using the environment state is one possibility.

RE “environment state is not used” Many possible downstream tasks can be done using NATSGD. Using the environment state along with human gestures and audio is one of the possibilities. In this paper, Figure 5 shows one very specific benchmark task (as shown in the title of section IV. MULTIMODAL HUMAN TASK UNDERSTANDING) of using Speech and Gesture to generate linear temporal logic (LTL). This expands on prior work that uses Speech only to translate to LTL. The focus here is for the robot to generate all the subtasks in a particular order to accomplish the task based on the human command. This is a problem of translating multimodal commands into logic formulation, which the current benchmark accomplishes.

RE “NatSGD is not used” The NATSGD dataset was collected using the NATSGD simulator, and participants interacted with the teleoperated robot. As stated in Appendix B (lab setup), robots are slow, and it is very challenging for even state-of-the-art robots to perform complicated tasks such as cutting onions. This benchmark task aimed to collect data on natural human commands for task understanding. The NATSGD dataset, therefore, consists of human commands in the form of Speech and gestures, which are used in problem formulation, experiments, and benchmark tasks.

  1. In other works, they often setup tasks and use success rate as metric. But in this paper, the problem formulation is "converts a pair of speech and gestures into an LTL formula", and use JaqSim and SpotScore as metrics. Why is it set like this?

This is the problem formulation for our benchmark task MULTI-MODAL HUMAN TASK UNDERSTANDING explained in the section IV. The success rate metrics do not directly match this benchmark task. The NatSGD dataset considers real-world-level challenging tasks, which include a set of subtasks that can be temporarily ordered in different ways. Therefore, the core aspect of the task understanding problem is to predict a proper temporal organization of all necessary subtasks given human's multimodal command (in the form of a pair of speech and gestures). Since we use Linear Temporal Logic (LTL) to represent the temporal organization of those subtasks, the core objective of evaluation becomes the performance of predicting LTL formula from a pair of speech-gesture commands. Looking back on previous works on LTL generation or prediction, we realized that Jaccard Similarity [49, 50] and Spot Score [47, 48] are reasonable evaluation metrics. Please see paper section V.A for the more details regarding the Jaccard Similarity and Spot Scores.

  1. Are the evaluation criteria set appropriately? (a) First, it may be better to include more redundant actions that can complete the task than to have fewer actions that cannot complete the task.

Can you please clarify? We are not sure if we understand your question. What do you mean redundant actions?

(b) Second, the sequence of actions may be very important for completing a task, but it does not seem to be reflected in evaluation metrics.

Do you mean a sequence of subtasks? We use the SPOT score, which measures the equivalency of the logical formula. It uses the logical operators to mathematically compute the equivalency of output of the LTL formula. A complete LTL formula will capture the linear importance of the order of each subtask. For example, if you are cutting an onion, the order of picking up the knife or onion may not matter (assuming using two hands which is the case in this dataset) then an OR operator ‘|’ is used. At the same time, if order matters then NEXT operator ‘X’ with AND operator ‘&’ are used. So cutting a tomato might look something like this. So the LTL formula would reflect this:

X (
  G (
    X ( G ( C_Tomato U Tomato ) & G ( Tomato_OnTopOf_CB & Tomato_CloseTo_CB ) 
    | 
    X ( G ( C_Knife U Knife ) & G ( C_Knife U Knife_FarFrom_CT ) 
  )
  & G ( C_Knife U Knife_OnTopOf_Tomato ) 
  & G ( ( C_Knife & C_Tomato ) U Tomato_Pieces) )
)

In this case, SPOT would consider picking up a knife first then tomato to be equivalent to picking up tomato first then knife. However, if cutting the tomato appears before the picking up, then spot would consider that to be not the same, so we use this method to mark the prediction to be incorrect if the order is not maintained.

For more details on SPOT, please see “Duret-Lutz, A., Poitrenaud, D.: SPOT: an Extensible Model Checking Library using Transition-based Generalized Buchi Automata. In: Proc. of MASCOTS’04. pp. 76–83. IEEE ¨ Computer Society Press (Oct 2004)”

Thank you for any reply.

Best Regards.

Xianqi-Zhang commented 7 months ago

Thank you very much for your reply.