nottombrown / rl-teacher

Code for Deep RL from Human Preferences [Christiano et al]. Plus a webapp for collecting human feedback
MIT License
556 stars 93 forks source link

only pretraining comparisons appear in the labeling interface #36

Open mixuala opened 6 years ago

mixuala commented 6 years ago

I got to this point following the RL-teacher Usage docs

Once you have finished labeling the 175 pre-training comparisons, we train the predictor to ? convergence on the initial comparisons. After that, it will request additional comparisons every few seconds."

I was able to use the human-feedback-api webapp to provide feedback for the 175 pre-training labels. After that, the agent began to learn based on the pre-training feedback

8900/10000 predictor pretraining iters... 
9000/10000 predictor pretraining iters... 
9100/10000 predictor pretraining iters... 
9200/10000 predictor pretraining iters... 
9300/10000 predictor pretraining iters... 
9400/10000 predictor pretraining iters... 
9500/10000 predictor pretraining iters... 
9600/10000 predictor pretraining iters... 
9700/10000 predictor pretraining iters... 
9800/10000 predictor pretraining iters... 
9900/10000 predictor pretraining iters... 
Starting joint training of predictor and agent

But joint training failed. The human-feedback-api webapp displayed only blank screens. When I checked the URL for the videos in a separate tab, I got an XML error message that said The specified key does not exist

At the same time, the teacher.py script continued to generate video samples and upload to GoogleCloud

Operation completed over 1 objects/14.4 KiB.                                     
Copying media to gs://rl-teacher-snappi/abb3e1ed-f78e-459d-bed8-a1865ed541b1-right.mp4 in a background process
Copying media to gs://rl-teacher-snappi/c21384b2-7395-49b5-b263-5200221a3a36-right.mp4 in a background process
Copying media to gs://rl-teacher-snappi/c21384b2-7395-49b5-b263-5200221a3a36-left.mp4 in a background process
Copying file:///tmp/rl_teacher_media/c21384b2-7395-49b5-b263-5200221a3a36-left.mp4 [Content-Type=video/mp4]...
Copying file:///tmp/rl_teacher_media/c21384b2-7395-49b5-b263-5200221a3a36-right.mp4 [Content-Type=video/mp4]...
Copying file:///tmp/rl_teacher_media/abb3e1ed-f78e-459d-bed8-a1865ed541b1-right.mp4 [Content-Type=video/mp4]...
\ [1 files][ 14.8 KiB/ 14.8 KiB]                                                
Operation completed over 1 objects/14.8 KiB.                                     
\ [1 files][ 15.8 KiB/ 15.8 KiB]                                                
Operation completed over 1 objects/16.1 KiB.                                     

Operation completed over 1 objects/15.8 KiB.                             

I can manually confirm that the media files exist in Google Cloud

I waited many minutes, refreshed the webapp, even clicked can't tell a few times, but the video never reappeared after the (successful) pre-training.

nottombrown commented 6 years ago

What is the URL for the key that does not exist? Perhaps your human-feedback-api webapp doesn't know what the correct bucket to look at is

mixuala commented 6 years ago

I'm not an expert in django yet, so I've just been hacking away. But it seems that the problem is in a sort order between the process which records video segments order_by('+created_at') and the way human-feedback-api webapp displays segments by order_by('-created_at')

I added the following hack and it seems to fix the problem. But I think

# ./rl-teacher/human-feedback-api/human_feedback_api/views.py
def _all_comparisons(experiment_name, comparison_id=None, use_locking=True):
    not_responded = Q(responded_at__isnull=True)

    cutoff_time = timezone.now() - timedelta(minutes=5)
    not_in_progress = Q(shown_to_tasker_at__isnull=True) | Q(shown_to_tasker_at__lte=cutoff_time)
    finished_uploading_media = Q(created_at__lte=timezone.now() - timedelta(seconds=25)) # Give time for upload
    ready = not_responded & not_in_progress & finished_uploading_media

    ##  order by created_at ASC, same as id
    ascending=True
    if ascending:
        ## Sort by priority, then put OLDEST labels first
        ready = not_responded & finished_uploading_media
        return Comparison.objects.filter(ready, experiment_name=experiment_name).order_by('-priority', 'id')
    else:
        return Comparison.objects.filter(ready, experiment_name=experiment_name).order_by('-priority', '-created_at')

But I'm not exactly clear how RL with human feedback is supposed to work. I'm running the experiments on an old MacBook Pro, so the availability of recorded video is always behind the latest comparison as shown by whats uploading in the logfile. I give feedback on 3-5 comparisons, then come back 10-20 mins later for the next batch.

But it seems to me that the most recent comparison/video segments have the benefit of more Q-learning––and rating these comparisons would have a greater learning benefit. If I only provide feedback on a few comparisons every 20 mins, would I get better results by giving feedback for the most recent ones? Does the learning algorithm still work if I offer sparse feedback, or do I need to provide feedback for every comparison?

if yes, then I suppose it would be better to record and provide feedback on video segments based on the most recent experiments first. right?