mayacakmak / se2

Control interfaces for manipulating SE2 configurations
BSD 2-Clause "Simplified" License
1 stars 0 forks source link

Create MTurk HIT #15

Closed mayacakmak closed 3 years ago

mayacakmak commented 4 years ago

Basic instruction, link to study, place to enter completion code.

mayacakmak commented 4 years ago

https://requester.mturk.com/create/projects/1636568

mayacakmak commented 4 years ago

Almost ready to launch...

Screen Shot 2020-08-27 at 9 57 20 PM
mayacakmak commented 4 years ago

Looks like all issues are addressed, I'll go ahead and publish the first small batch (we can call it pilot ;)). Posting this message as sort of a time stamp for when the 'real' data begins (because I just tested a few things that probably got logged). I'll get the list of completion codes tomorrow so you can verify the data @KaviMD. Fingers crossed!

mayacakmak commented 4 years ago

Hmm only two participants so far, I wonder if we're not paying enough ($2 per HIT) Average Time per Assignment: | 37 minutes 20 seconds <- Longer than I expected

kavidey commented 4 years ago

Yeah, that is a bit weird. Is it possible that MTurk is out of date or something? I looked at firebase last night before you posed the HIT, and there were only 7 user-ids under /users, now there is 22. Out of those 22, only 9 filled out the questionnaire (I checked the dates from the questionnaire and they were all from today), so maybe not everyone who fills out the questionnaire submits the completion code?

There are also several users who never completed a cycle or the questionnaire (or completed multiple cycles but not the questionnaire), I might add some code to log the url of the page whenever a new session is started so we can see if users are getting stuck or reloading a specific page or something.

mayacakmak commented 4 years ago

Oh you're right.. it had logged me out and was showing the screen from last night I guess. Here are the responses.

Screen Shot 2020-08-28 at 11 52 04 AM
mayacakmak commented 4 years ago

The rest might be people checking out the study and decide if they want to do it..

mayacakmak commented 4 years ago

Now: Average Time per Assignment: | 34 minutes 12 seconds Still longer than I thought but it's good if people are spending time on the questionnaire especially.

Next we might need to figure out an "approval criteria" and "study inclusion criteria" based on what we see in the data.

kavidey commented 4 years ago

I went through the data for those users. The cycles look good, and when I ran the data analysis it also had decent results. The distribution across interfaces was not uniform though (Because some people tried it out but never finished)

dgb7CWy7rSNWIAZHXEYDRAt3O2b2: arrow.press/release 6qW2fw5bT5hLvsAUO7MaR82ZqOu1: panel.click 3gz7ZfC2YVWEL83csaHrnHLqet42: arrow.press/release 7TIOuYmg42MtgXKvf7YcfCYX9l52: targetdrag.press/release 403NovdIENcvmKY9PoakqivpyP53: drag.click pfZvthWm1iY4Cm11co2LgBEnimj1: targetdrag.click OEoHdni4GFXI6iKVbbW1pOGNXps2: arrow.click J2eQ4D9j9wVgj0oNOpZLWCenWnh1: arrow.click iTccwNLgo2OCqpG2BR565G1iUtA2: drag.press/release

Two interfaces were skipped, and two were done twice

There was a bug in logging that I just fixed (in between tests, while the "I'm ready!" button was still on screen, the eeLogger interval was logging poses unnecessarily to an empty cycle the database)

Given that the cycles looked good in the replay window, and that the data processing was able to run correctly, I think that everything is working and ready for an actual test.

The only thing that we might want to change is how we pick interfaces (we could switch to just randomly picking, though that might have the same uneven distribution as going through them one by one).

mayacakmak commented 4 years ago

Nice, thanks for checking! Ah yes, we didn't think about the uneven distribution problem. I think we can just run a certain number this way and then we'll change the distribution based on what we get, until we have evened them out. Or alternatively, @KaviMD perhaps implement something in instructions.html that samples the next interface based on how many more tests are needed for each interface--we can update those numbers manually as we verify data (such that if we have enough data for an interface that number will be zero and the interface will not appear anymore).

Nice catch on the bug! Does that make the previous data invalid or is it something we need to clean up manually?

LMK if you're editing anything and when I should release the next round of HITs.

kavidey commented 4 years ago

I can definitely get started on an update to instructions.html for sampling interfaces based on how many more tests we need.

From what I can tell, all of the data we collected previously is still valid. The bug only resulted in us collecting extra EE pose data which doesn't affect the data analysis or admin replay interface. (I also ran the data processing script on it and the results looks reasonable. https://github.com/mayacakmak/se2/blob/master/data-processing/data_processing.export.ipynb).

kavidey commented 4 years ago

I implemented some code that tracks how many of each interface needs to be done, are in progress, or are completed.

When a user loads instructions.html, it automatically selects the interface that has been completed the fewest number of times. When selecting an interface, the number of users currently working on that interface is subtracted from how many more times that interface needs to be completed, and the interface with the largest number is picked. That user is then added to an "in progress" list which they stay in until they submit the questionnaire, or a specific amount of time has passed and we assume that they have given up.

While trying to calculate how long that time should be, I found something a bit weird with the MTurk timing. MTurk reported the average time as roughly 34 minutes. By looking at session start and questionnaire submit times, I generated an estimate for how long each session took, and my estimate came out to 24 minutes. These are the exact timings in minutes from each user:

MTurk: [29, 34, 44, 22, 43, 34, 28, 30, 39]
   Us: [24, 23, 12, 21, 31, 27, 23, 26, 36]

User Order: ['3gz7ZfC2YVWEL83csaHrnHLqet42', '403NovdIENcvmKY9PoakqivpyP53', '6qW2fw5bT5hLvsAUO7MaR82ZqOu1', '7TIOuYmg42MtgXKvf7YcfCYX9l52', 'J2eQ4D9j9wVgj0oNOpZLWCenWnh1', 'OEoHdni4GFXI6iKVbbW1pOGNXps2', 'dgb7CWy7rSNWIAZHXEYDRAt3O2b2', 'iTccwNLgo2OCqpG2BR565G1iUtA2', 'pfZvthWm1iY4Cm11co2LgBEnimj1']

I'm not sure why there is an average difference of 10 minutes between the time I measured and what MTurk measured. Its possible that some users didn't submit the completion code immediately. (I'm not sure exactly how MTurk measures the time spent so there could easily be other factors I'm not thinking of).

It seems to me like the correct cutoff time (to figure out when a user has given up) is around 30 minutes, but I definitely find it a bit weird that the times we measured don't match up with the ones from MTurk. Right now, the algorithm does not decrease the number of times that an interface needs to be completed when a questionnaire is submitted, but I can easily add that. (I'm not sure if we want to update that manually as we review results or automatically)

mayacakmak commented 4 years ago

Thanks @KaviMD! This solution sounds great. Let's go ahead and have it automatically reduce the number when a questionnaire is submitted. Make the number high enough so it doesn't hit zero if many people are "trying out" the study in parallel. We can then manually increase the numbers, as many as needed, if we decide to throw out some data. Let me know when it's ready and I'll start a new batch (let's say of ~100).

Hmm, good catch on the timing discrepancy. I wonder if Turkers do this on purpose to inflate their 'time on task' on the MTurk side. Not all participants did this (some have just 1-3 minutes). We should probably be suspicious of ones who spend very small amount of time (as measured by us), e.g. see if their open ended answers are intelligible.

kavidey commented 4 years ago

I updated the algorithm so automatically reduces the count. I also set the count for each interface to 30 (That is 270 total, which should be more than enough if we are starting a batch of 100) https://console.firebase.google.com/u/0/project/accessible-teleop/database/accessible-teleop/data/~2Fstate

The whole system for selecting interfaces is getting pretty complex, so I ran several locally to try and catch possible bugs. I didn't find anything in my testing so I think that everything is ready for the larger MTurk test.

mayacakmak commented 4 years ago

Published a batch of 100 now! fingers crossed

kavidey commented 4 years ago

It looks like the first set of results are coming in. There was a timing bug in the code that prevented about 80% of users from moving on from the testing page to the survey. I believe it is now fixed, but anyone who loaded the testing page before I was able to upload the fix might not be able to access the survey.

I have canceled the batch so that no additional money or time is wasted while we are verifying the fix. We can test with a smaller number of workers tomorrow and make sure that everything is correct.

kavidey commented 4 years ago

I'm going through the data and verifying everything (the .json file that I downloaded from firebase was 4.5 million lines long, so my data analysis code is running very slowly, an the admin page for viewing cycles wouldn't even load).

There was a bug in questionnaire.html where we weren't collecting data from textareas, or inputs (so only the checkboxes), that is now fixed, however, that means that for the 46 people that we collected data from, we are missing age, open ended feedback on the interfaces, and data on video game usage or mouse type. I'm sorry for not catching this earlier, when we ran the initial 10 person test, it looked like all of the survey data was there. I have put together a modified survey with just the open-ended questions that we could send that to the workers with an additional bonus payment. https://mayacakmak.github.io/se2/questionnaire_open_ended

I also looked through the distribution across interfaces, and it looks okay. This is how many times each interface was completed: 4 4 5 7 6 5 4 9 2. I think the reason for the uneven distribution was because all 55 users started the study within half an hour, so the code for detecting whether users stopped never got a chance to run. I think that the only real way to fix that would be to somehow ask MTurk to limit the speed at which workers can accept HITs. We could also manually start several smaller batches of 10 across a few hours.

The results of the data analysis look reasonable

Time Stats

targetdrag.click
Mean: 5.635094445281559
Standard Deviation: 4.7222853235030735
Min: 1.3489999771118164
Max: 32.84299993515015

targetdrag.press/release
Mean: 7.17937778102027
Standard Deviation: 8.695988337334255
Min: 2.878000020980835
Max: 74.12599992752075

arrow.press/release
Mean: 9.398115952809652
Standard Deviation: 4.5684250823839845
Min: 1.8429999351501465
Max: 30.82800006866455

target.click
Mean: 8.032236106969693
Standard Deviation: 12.340539655272156
Min: 2.263000011444092
Max: 153.18300008773804

drag.click
Mean: 11.314545140498215
Standard Deviation: 12.570198125814244
Min: 2.3410000801086426
Max: 143.0699999332428

arrow.click
Mean: 16.950652769870228
Standard Deviation: 10.232436507019614
Min: 2.5279998779296875
Max: 63.501999855041504

panel.click
Mean: 19.962791668044197
Standard Deviation: 10.457684877610706
Min: 4.728000164031982
Max: 61.12999987602234

drag.press/release
Mean: 8.012413409835133
Standard Deviation: 4.715115802139077
Min: 1.6119999885559082
Max: 37.2279999256134

panel.press/release
Mean: 17.4580676681117
Standard Deviation: 19.534824814983104
Min: 2.2749998569488525
Max: 276.55000019073486

Time vs Target Distance

image

Should I go ahead, approve all of the results from yesterday, and send a message to the workers with a $1 or $0.50 bonus payment and a link to the second survey? A bonus of $0.50 would mean spending an extra $23, ($1 would be $46). I also don't know if there are issues with it having been too long since they initially completed the survey. Sorry again about not catching this earlier

mayacakmak commented 4 years ago

Sorry I'm seeing this only now. Bummer about the open ended answers, but again good job catching that early. Did you end up sending the follow up survey? I think that might not work well with the data association etc. or might be more work than we need--if you haven't tried it, we can instead just remove that data. We had budgeted for up to 50 people per condition, we should probably get to at least 20 (so throwing out around 50 total from all conditions is fine in the bigger scheme of things).

The data is very interesting.. next time can you plot with the x-axes scaled the same for each subgraph?

kavidey commented 4 years ago

That's fine, I didn't end up sending the follow-up survey. In terms of data association, I figured that we could just use ask workers to input their Worker ID to the questionnaire and then match that up with the ids from the initial survey. I'm happy to just remove their data though.

These are the updated graphs (keep in mind that this is not taking into account flexible position flexible rotation, or the orientation of the target): image This is the same graph but with smaller x-axis values to remove outliers image

I'm looking into a way to graph the euclidean distance from the target to the EE, but normalize it by flexible position/rotation so we can clearly see its effects. Let me know if there are any other graphs that would be helpful!

kavidey commented 4 years ago

This is correlation matrix one way of looking at how the flexibility of the target impacts cycle times: image In this case flexibility is a normalized sum of flexible rotation and flexible position.

I'm currently double checking all of the data from the previous study to make sure that everything is working, but we should be ready to run the next study fairly soon.

mayacakmak commented 4 years ago

Nice! Removing the outliers was a good idea, we can see the interesting patterns better in the second graph. Some quick observations/notes:

mayacakmak commented 4 years ago

Now looking at correlation matrices.. I agree it makes sense to look at correlation.. in fact, we'll probably want to plot the fitted line on the X-vs-time graphs and report correlation (how well the line is actually fitting) for each of them. Again, for now, I would not try to combine "xy" and "theta" metrics into one metric, but we could do that later if everything looks similar for the separate measures.

mayacakmak commented 4 years ago

Great, we can start a new round soon (let's try to hit 20 per condition) then look at these graphs with the cleaned up data.

kavidey commented 4 years ago

I have gone through the data and I think that everything is ready to run a new round.

We might not want to start all 180 at once though. Because MTurk starts everyone with in the first 20-30 minutes after posting, little of the code that I wrote to balance interfaces has a chance to come into effect (because it can only count someone as having given up after 30 minutes). If there is a way to spread it out either through MTurk or by starting multiple smaller rounds of 10 or 20 that might help even out the distribution.

We could also start 180 right now and then run a second round just to balance everything out.

mayacakmak commented 4 years ago

Great! I just released the next 100 HITs. Let's see how that goes and we can collect more in smaller batches after that.

kavidey commented 4 years ago

It looks like all 100 have been completed. Aside from 3 people who submitted their worker ID, instead of their UID as their completion code, all the data is there. (I assume that all the data for those 3 users is still in our database, we just don't have an easy way to link up the worker to their data)

In terms of the interface distribution, it looks really good. 11 11 12 10 11 12 11 11 12 That technically adds up to 101 instead of 100, I'm still looking into exactly what happened.

I am still updating and running the data analysis as of right now. I might create a new GitHub issue for that just to keep this from getting to cluttered.

mayacakmak commented 4 years ago

Awesome! could the extra 1 be a test that we ran? or someone did everything but not submit? We can still approve all HITs I guess, but perhaps double check the data entries for unclaimed UIDs -- since they didn't pay attention to the instructions to get the code at the end of the survey, they might have missed other things.

So should I do ahead and release another batch (e.g. try to hit 20 or 25)? will it be easy to combine the data?

kavidey commented 4 years ago

We should be good to release another batch. All of the data from the previous batch is still on firebase, so it should be extremely easy to combine them once we have the list of new UIDs.

Do we need to worry about the same users doing the study twice? I don't know if MTurk will handle that or not.

mayacakmak commented 4 years ago

Hmm good point.. I think it ensures unique worker ID per batch not sure about across batches. Let's check about that with others.

kavidey commented 4 years ago

I don't know if there is an easier way to do it, but it seems like we could create a custom qualification that that excludes anyone who has already completed the study: https://blog.mturk.com/tutorial-best-practices-for-managing-workers-in-follow-up-surveys-or-longitudinal-studies-4d0732a7319b

mayacakmak commented 4 years ago

Yes, I see that's also what Amal and Nick did for their studies. I'll go ahead and do that for the next round.

mayacakmak commented 4 years ago

Noting this here, in case it explains an outlier: Ti70UEMLCFNHJi9Hhzojkm1iwCx2|was delayed from responding to one of the trials due to computer issue

mayacakmak commented 4 years ago

Another one, they might have watched the video for a different interface:

uzWPwecAL8gOsYysuNPjqiLeNrJ3|I watched the instructional video and then went on to do the study, but it did not load properly. So I clicked out of the app, clicked the link on the HIT page again, and this time it worked. But since I had already watched the instruction video on my previous attempt to make it work I skipped the video on the second time when it fully loaded. I just wanted you to know that I did watch the video in case it shows up in your analysis that I did not.

mayacakmak commented 4 years ago

Okay, just created the qualification and published a batch of 90 HITs for new workers!

kavidey commented 4 years ago

All 90 have been completed and processed. I uploaded the files to the same google drive folder: https://drive.google.com/drive/u/0/folders/1R89qRGApJjWt0EQz45dtSLZaQNc5mKGz

The distribution looks great: 22 21 22 21 22 21 21 21 21 There is a total of 192, which makes me wonder if there is something that we're not noticing because there has been exactly 1 extra completion in both batches.

When I to download the .json file so that I could process it to generate a new .csv file, but firebase lagged out. I think that this was because of the amount of data that we are storing. image I was able to download the file manually using wget and the URL in my browser. When I tried to search through the file just as a sanity check for some specific uid's, VS Code crashed. I was able to run my data json-to-csv script without too much trouble, and search through everything once it was in csv form.