According to the paper of WizardCoder along with your excellent job, the "output"s or "original answer"s of seed_tasks are not applied to Evol-Instruct, i dont think it is reasonable.
I have a little doubt that whether the SFT dataset of WizardCoder contain human-eval examples XD