assess ease performance on short essay responses

kern3020 commented 11 years ago

Hello,

I am taking “HarvardX: PH278x Human Health and Global Environmental Change”. Is this using discern/ease for the one short essay problem? If so, I am not impressed with the initial results. How can we assess discern/ease’s performance?

-jk

MILANKOVITCH CYCLE - SHORT ANSWER RESPONSE

Some have claimed that the carbon dioxide and temperature records presented in class refute the hypothesis that greenhouse gases, and in particular carbon dioxide, drive warming. Look closely at the time periods in which there is sudden temperature rise, such as those indicated by the arrows. In one sentence or two at most, please describe how the data presented in the graph below runs counter to the hypothesis that present day planetary warming is being driven by increased carbon dioxide concentrations?

answer: ‘This graph suggests that increase warming causes increases in CO2.’

system response: incorrect 0 points : The response does not explain what about the data presented in this graph runs counter to the hypothesis that present day planetary warming is being driven by increased carbon dioxide concentrations. To see more information about the correct answer, copy and paste into your address bar: https://studio.edx.org/c4x/HarvardX/PH278x/asset/Milankovitch_Answer.pdf 1 points : The response explains that periods of planetary warming in the past began prior to rises in carbon dioxide. To see more information about the correct answer, copy and paste into your address bar: https://studio.edx.org/c4x/HarvardX/PH278x/asset/Milankovitch_Answer.pdf

VikParuchuri commented 11 years ago

That question is using the EASE technology, wrapped by edx-ora, in order to score student responses. We are in the process of getting and evaluating data on how that is working. AI scoring will not work in all domains, and the construction of the rubric is very important in how students receive feedback.

In order to assess performance from a developer perspective, you can feed data into the test suite. We will be doing something similar with the data from this. Data is also available to course staff on how well the machine model is performing, and they have an option to grade the lowest-confidence responses to increase scoring accuracy.

kern3020 commented 11 years ago

To fix software, the developer needs to reproduce the problem. How can a testcase be created for this behavior?

This presents another fundamental issue. What is the policy on sharing data sets?

On one hand, one needs the rubric, the data set used to train the model and one or more test essay’s which were not scored as expected to understand and improve EASE.

On the other hand, sharing all student data would be a violation of privacy.

VikParuchuri commented 11 years ago

Yup, you hit the nail on the head. We can run these test cases internally with student data, but we have yet to figure out how, if at all, that data can be shared externally.

kern3020 commented 11 years ago

Hello @VikParuchuri,

Suboptimal.

The test suite is based on a single dataset (Bo Pang and Lillian Lee’s 2004 dataset on Sentiment Analysis). What about staging an experimental website to expand on this? Essay questions would be posed and we would ask the public to respond. Consider a techie question. We could promote it on slashdot or stackoverflow. Since there are websites/forums for many topics, we could get responses on a wide variety of topics.

This would differ from production use in two ways.

1) It would be clear to the participants that we were going to use the results to improve EASE. Since developers need the data, it would be publicly available as part of the testsuite.

2) The developers (not the instructors) would drive the topics. In production, this is of course absurd. However, developers directly and via bug reports have a good ideas of weaknesses. This would allow us to focus on and strengthen them.

What do you think?

-jk

VikParuchuri commented 11 years ago

I like this idea. However, here is what I think are the ways to get data, in order of least to most effort:

Find other data online that can be used for validation.
Sufficiently anonymize edX data so that it can be used for validation (have to figure out if we can/how internally).
Automatically label unlabeled data by focusing on the source (maybe an article from the new york times gets a 10/10, and an article from a forum gets a 5/10, etc).
Label some existing essay data, perhaps using mechanical turk.
Make an experimental website to gather data, socialize it, and gather sufficient quality responses to analyze.

I don't think that there will be much bandwidth to work on 5 right now.

Do you know of any other reasonable labelled or unlabeled data sets? I've used mechanical turk for labeling before, and it isn't perfect, but it is pretty good.

kern3020 commented 11 years ago

Hello @VikParuchuri,

Option 3 is intriguing. Elaborating just to make sure I understand. A program to experiment with ease might look like.

Pose an essay question.
- Select score for “elite” essays.
- Select score for “independent blog” essays.
- define a rubric
select a collection urls for websites which have earned a good reputation(“elite”). Use them to train the model. This program will scrap the website and score them.
select a collection urls for independent blogs on them. Use them to train the model. This program will scrap the website and score them.
Select a collection of urls which to test. EASE will be used to score them.

Using a spreadsheet would allow instructors to define these input. The program could then parse and read the inputs from it. I have found this is very useful way to get input from experts without programming experience.

-jk

PS Socializing data is a new concept for me. Could you provide a reference so that I can read up on it?

openedx-unsupported / ease

assess ease performance on short essay responses #44