Open CM3Turk opened 7 years ago
As mentioned on the home page of the site, the recommended field will not be moderated. It's the only purely subjective field in the review form and exists to give an immediate overview of the reviewer's own opinions and impressions of the HIT. People are free to use it in whatever manner they feel is the most useful to others. Imposing arbitrary rules on subjective fields is exactly the sort of thing we're trying to avoid.
As such, I don't feel unpaid screeners are really relevant here. If someone had a good experience with a HIT despite an unpaid screener, they should be free to give a positive recommendation. Likewise, if someone had a poor experience due to an unpaid screener, they should be free to give a negative recommendation.
What will end up happening with unpaid screens is frustrated people will rate 0 for reward and X seconds for time, unless there are rules saying not to do that, and I'd bet those rules will cause drama considering dinging pay is allowed on the old site. Recommended might as well be the fair field and $/hr the pay field.
That's definitely a valid concern. I think this may be a bigger issue during the transition from legacy to 2.0 than a longer term thing.
It already states in the rules that data manipulation and falsifying data are prohibited, and people intentionally modifying reward and/or completion time values to suit their own narrative are doing both.
I don't think it's necessarily falsifying data though, if someone goes through a whole hit, gets no code at the end but the mturk side explicitly says it needs a code, it's 10 minutes with no reward. Sure it might be a broken hit, but it might not be, there's requesters out there who do this intentionally.
It's no different with unpaid screeners, what's the difference between an unpaid 2 minute screener at the beginning of the hit, and a paid 2 minute screener that pays $0.01 broken out into a seperate qual hit? One's going to get rated at $0.005/hr, and drag the score down, the other won't.
On the legacy site, the pay scale is unmoderated, the fairness scale is heavily moderated. On 2.0 it's the opposite, at a minimum it's going to cause confusion, and I don't think a blanket "no data manipulation" rule is going to cover it.
Just an example: https://turkopticon.ucsd.edu/reports?id=A1ODQTEISI4V1G
I guarantee you people are going to want to rate pay on the new site in this case.
I see your point and I understand why people may feel motivated to do that, but on the other hand, why should those that get screened out be able to affect the pay rate? Imagine a HIT that pays $25/hr for people that don't get screened out. If that rate gets artificially dragged down to $5/hr then it's no longer accurate for anyone. Neither party received anywhere near $5/hr. Sure, perhaps the requester should structure the screening process differently, but they are actually paying very well and the data should accurately reflect that.
By allowing reviewers to artificially manipulate data because they feel slighted, Turkopticon's data becomes increasingly inaccurate and unreliable. This destroys trust, and without trust what good is a review system?
The reward section is not asking 'what is the amount you received on this HIT'. It states what the HIT pays, not what the reviewer actually received. It doesn't matter if someone was rejected or if they didn't finish the HIT for whatever reason. The reward does not change. Modifying the value such that it's no longer representing the HIT, but rather the individual reviewer, is absolutely falsifying data and it negatively affects data integrity. It's really no different from someone adding a review for a fake HIT that never existed.
There are other fields with which to express their dissatisfaction. In the new system I imagine shady requesters will have very few recommendations (and possibly a high rejection and broken count where applicable).
I guess these are the very same issues that was faced on the legacy site. Most of the time, workers will just peak at the TO script for a quick overview of the HIT/requester and make a judgement from that without reading through the recent reviews. I was getting conflicting reviews on the legacy site once people started rating the "fairness" in the red for unpaid screeners, and not necessarily for rejections and I was unable to make a good judgement from the script after that rule was allowed. This is why I really like how the new TO script overviews the requester by displaying a field for number of rejections, broken HITS and the pay per hour field ... I can get a more clear judgment from that then the legacy script that only showed the red, yellow, and green scales. I knew that the "recommendation" field was purely subjective, and I wasn't surprised when other workers started arguing with me about why I recommended some HITS. But I tried to explain that people would know how well the HIT paid based solely on the pay field displayed without relying on the recommendation.
I agree that no matter the outcome of the HIT, such as not passing a screener, receiving a rejection, or broken HIT, the "pay field" should always display the "accurate" listed payment for the HIT. Rating the pay a $0 because of these issues would technically be manipulation, though the person is not lying about how much "they" were compensated; I wouldn't say it is "intentional" manipulation but maybe having TO members initial a digital agreement about certain rules might help in the understanding, and they won't be use the site until the agreement page is signed. Then possibly suspend or delete accounts if they consistently break the rules that hurt the reputation of the site after given warnings.
I can see the argument both ways on this. What's shown on the requester review page isn't a flat out objective "Average Reward" it's "$/hr". It's already skewed by a subjective time. The api splits this up, how it's interpreted is up to whoever is reading the api.
As far as:
"The reward section is not asking 'what is the amount you received on this HIT'. It states what the HIT pays, not what the reviewer actually received. It doesn't matter if someone was rejected or if they didn't finish the HIT for whatever reason."
What about bonuses? In that case it's flipping it the other direction. I don't think there should be any difference between "what the HIT pays" and "what the reviewer actually received". If you throw bonus's into the mix, one person get's $100 bonus, everyone else get's stuck with a survey that paid $0.25 and took 20 minutes. If a reviewer can't state they received 0 for an unpaid screen, should someone be able to say they were the lucky one to get paid that $100 bonus and skew the $/hr way up? By that logic, no, the reward field is strictly for what's stated on mturk as the reward.
Also regarding
Imagine a HIT that pays $25/hr for people that don't get screened out. If that rate gets artificially dragged down to $5/hr then it's no longer accurate for anyone. Neither party received anywhere near $5/hr. Sure, perhaps the requester should structure the screening process differently, but they are actually paying very well and the data should accurately reflect that.
This is a fundamental thing. I don't think a hit that pays $25/hr for one person, while paying 100 people $0/hr for a 4 minute screener should be rated as $25/hr. "They are actually paying very well" is not accurate at all. 100 people just worked for free. I think $5/hr is more accurate here than $25/hr.
Should https://turkopticon.ucsd.edu/reports?id=A39YZ5CCWDI0H7 be rated at $25/hr? I would wager a lot of people would say no.
Edit: The idea that a requester who abuses unpaid screeners is "actually paying very well" is contrary to the Guidelines for Academic Requesters that TO endorses on the civility guidelines page. http://wiki.wearedynamo.org/index.php/Guidelines_for_Academic_Requesters#Compensate_for_qualifier.2Fscreener_surveys
How do you think everyone using legacy TO would react if all of the sudden you couldn't rate the pay field for unpaid screens, broken hits and rejections, and moderators now flagged and hid any reviews that did? This is essentially what we're talking about here.
Here's an idea that might please everyone. Add a single checkbox somewhere for forced returns, and allow people to enter a time for everything.
You could then use that to have a "$/hr" estimate, and a "Real $/hr" estimate, that takes into account forced returns because it's broken, unpaid screens, etc, and rejections.
The "Real $/hr" estimate would then be the "$/hr" estimate, with the forced return / rejection times and rewards ($0) mixed in.
Edit: You wouldn't even have to change the database or add a checkbox. The "I didn't finish the HIT" checkbox could be used for this purpose.
I have had spats with people who disagree that if a HIT is underpaid that it shouldn't be "recommended". Though some HITS may be a little underpaid, based on how quickly "I" can complete them, I sometimes "recommend" the HIT if it was very easy to complete, had a paid screener, has a quick auto-approval, has a responsive requester, has no technical issues and so on ... How should the "recommendation" be based on and not based on? Also ... would it be fair to not "recommend" a HIT if it contains an unpaid screener, which would be the same as the rule allowing HITS that contained unpaid screeners to be rated in the red for "fairness" on the legacy site. I ask because some people who did complete the unpaid screener and qualified and then was paid fairly would "recommend" the HIT opposed to those who did not qualify and wasn't paid for the screener who may "not recommend" the HIT. It may lead to conflicting reviews of the HIT. I think a "Contains Unpaid Screener" section should be applied to reviews to filter out this conflict.