google-research-datasets / great

The dataset for the variable-misuse task, used in the ICLR 2020 paper 'Global Relational Models of Source Code' [https://openreview.net/forum?id=B1lnbRNtwr]
Other
22 stars 12 forks source link

Empty or non-candidate repair targets #1

Closed mallamanis closed 3 years ago

mallamanis commented 3 years ago

Some buggy examples (has_bug=True), have either an empty repair_targets or none of the repair_targets is not in the repair_candidates field (below see a list of such examples from train__VARIABLE_MISUSE__SStuB.txt-00000-of-00300).

(format: provenances json, repair_candidates, repair_targets)

[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "dahlia/libsass-python/sass.py", "license": "mit", "note": "license: bigquery_api"}}] repair_candidates=`[2, 12, 23, 41]`, repair_targets=`[15]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "kuri65536/python-for-android/python-modules/twisted/twisted/conch/client/knownhosts.py", "license": "apache-2.0", "note": "license: bigquery_api"}}] repair_candidates=`[2, 28, 124, 4, 38, 11, 120]`, repair_targets=`[20]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "saltstack/salt/salt/renderers/py.py", "license": "apache-2.0", "note": "license: manual_eval"}}] repair_candidates=`[54, 123, 138, 150, 12, 25, 2, 23, 36, 49, 64, 4, 102, 104, 106, 8, 110, 112, 114, 117]`, repair_targets=`[]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "openstack/pylockfile/lockfile/__init__.py", "license": "mit", "note": "license: bigquery_api"}}] repair_candidates=`[2, 40, 4, 42, 44, 85]`, repair_targets=`[13]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "ithinksw/philo/philo/models/fields/entities.py", "license": "isc", "note": "license: bigquery_api"}}] repair_candidates=`[2, 22, 30, 51, 15, 39, 61, 4, 43, 56, 58]`, repair_targets=`[]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "CiscoSystems/avos/openstack_dashboard/dashboards/project/vpn/views.py", "license": "apache-2.0", "note": "license: bigquery_api"}}] repair_candidates=`[14, 29, 31, 7, 2, 16, 23, 4, 27, 34]`, repair_targets=`[]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "ioflo/ioflo/ioflo/base/fiating.py", "license": "apache-2.0", "note": "license: manual_eval"}}] repair_candidates=`[2, 15, 21]`, repair_targets=`[]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "jacobian-archive/openstack.compute/tests/fakeserver.py", "license": "bsd-3-clause", "note": "license: bigquery_api"}}] repair_candidates=`[2, 13, 19, 22]`, repair_targets=`[]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "epmoyer/ipy_table/ipy_table.py", "license": "bsd-3-clause", "note": "license: manual_eval"}}] repair_candidates=`[2, 18, 12, 14, 21]`, repair_targets=`[]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "ClusterHQ/eliot/eliot/_validation.py", "license": "apache-2.0", "note": "license: bigquery_api"}}] repair_candidates=`[5, 52, 65, 9, 28, 44, 60, 11, 63, 7, 42, 54]`, repair_targets=`[18]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "feincms/feincms/tests/testapp/tests/test_extensions.py", "license": "bsd-3-clause", "note": "license: manual_eval"}}] repair_candidates=`[2, 10, 18, 23]`, repair_targets=`[]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "nii-cloud/dodai-compute/nova/virt/libvirt/firewall.py", "license": "apache-2.0", "note": "license: bigquery_api"}}] repair_candidates=`[4, 2, 18, 26, 37, 48, 54, 60, 66, 80, 96, 110]`, repair_targets=`[]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "amrdraz/kodr/app/brython/www/src/Lib/test/test_sys_settrace.py", "license": "mit", "note": "license: bigquery_api"}}] repair_candidates=`[30, 105, 2, 26, 101]`, repair_targets=`[7]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "trademob/anna-molly/test/test_sink.py", "license": "mit", "note": "license: bigquery_api"}}] repair_candidates=`[2, 32, 140, 153, 112, 122, 131, 151, 166, 16, 47, 148, 43, 45, 51, 67, 7, 70, 99, 114, 136, 178, 185, 62, 72, 90]`, repair_targets=`[106]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "marchon/Flask-API-Server/apiserver/authentication.py", "license": "mit", "note": "license: bigquery_api"}}] repair_candidates=`[4, 12, 46, 68, 2, 34, 38]`, repair_targets=`[15]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "marchon/Flask-API-Server/apiserver/authentication.py", "license": "mit", "note": "license: bigquery_api"}}] repair_candidates=`[4, 12, 46, 2, 34, 38, 68]`, repair_targets=`[15]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "nii-cloud/dodai-compute/nova/tests/api/openstack/test_images.py", "license": "apache-2.0", "note": "license: bigquery_api"}}] repair_candidates=`[2, 16, 28, 34, 5, 22, 25]`, repair_targets=`[]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "datastax/python-driver/cassandra/cqlengine/query.py", "license": "apache-2.0", "note": "license: bigquery_api"}}] repair_candidates=`[5, 18, 2, 13, 21]`, repair_targets=`[]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "OrbitzWorldwide/droned/droned/lib/droned/models/app.py", "license": "apache-2.0", "note": "license: bigquery_api"}}] repair_candidates=`[5, 19, 22, 2, 12, 25, 31]`, repair_targets=`[]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "cpbotha/nvpy/nvpy/view.py", "license": "bsd-3-clause", "note": "license: manual_eval"}}] repair_candidates=`[2, 19, 27, 33, 41, 47, 4, 21, 24, 6, 29, 31]`, repair_targets=`[]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "natestedman/Observatory/observatory/dashboard/models/Event.py", "license": "isc", "note": "license: bigquery_api"}}] repair_candidates=`[5, 138, 141, 2, 16, 26, 31, 47, 57, 68, 73, 89, 99, 110, 115, 132]`, repair_targets=`[]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "walkr/hn/hn/workers.py", "license": "mit", "note": "license: bigquery_api"}}] repair_candidates=`[2, 15, 23, 26, 37, 48, 59, 70, 81, 92, 4, 20, 28, 39, 50, 61, 72, 83, 94]`, repair_targets=`[]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "torchbox/wagtail/wagtail/wagtailadmin/utils.py", "license": "bsd-3-clause", "note": "license: bigquery_api"}}] repair_candidates=`[3, 19, 45]`, repair_targets=`[10]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "ipfs/py-ipfs-api/ipfsApi/client.py", "license": "mit", "note": "license: bigquery_api"}}] repair_candidates=`[2, 14, 20, 27, 4, 24]`, repair_targets=`[]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "cloudify-cosmo/cloudify-manager/tests/mock_plugins/testmockoperations/tasks.py", "license": "apache-2.0", "note": "license: bigquery_api"}}] repair_candidates=`[5, 15, 27, 34, 12, 46]`, repair_targets=`[]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "ChrisBeaumont/brut/bubbly/model.py", "license": "mit", "note": "license: bigquery_api"}}] repair_candidates=`[2, 15, 32, 34]`, repair_targets=`[7]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "toastdriven/alligator/alligator/gator.py", "license": "bsd-3-clause", "note": "license: bigquery_api"}}] repair_candidates=`[4, 39, 2, 18, 25, 31, 16, 37, 7, 42, 45]`, repair_targets=`[]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "vbmendes/django-meio-easytags/src/easytags/tests/test_library.py", "license": "bsd-3-clause", "note": "license: bigquery_api"}}] repair_candidates=`[2, 32, 17, 23, 29, 39]`, repair_targets=`[7]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "practo/r5d4/r5d4/mapping_functions.py", "license": "mit", "note": "license: bigquery_api"}}] repair_candidates=`[18, 75, 159, 11, 50, 88, 101, 134, 172, 185, 99, 2, 95, 179, 13, 20, 4, 27, 111, 183, 198]`, repair_targets=`[122]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "BU-NU-CLOUD-SP16/Trusted-Platform-Module-nova/nova/compute/cells_api.py", "license": "apache-2.0", "note": "license: github_api"}}] repair_candidates=`[11, 65, 18, 43, 45, 62, 8, 60, 2, 50, 24, 28, 33, 58, 68, 6, 20, 26, 4, 56]`, repair_targets=`[]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "sassoftware/conary/conary/local/capsules.py", "license": "apache-2.0", "note": "license: bigquery_api"}}] repair_candidates=`[2, 16, 25, 32, 39, 5, 19, 22]`, repair_targets=`[]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "mozilla/inventory/vendor-local/src/django-tastytools/tastytools/fields.py", "license": "bsd-3-clause", "note": "license: bigquery_api"}}] repair_candidates=`[2, 16, 5, 23, 26]`, repair_targets=`[]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "scrapinghub/python-hubstorage/tests/test_retry.py", "license": "bsd-3-clause", "note": "license: bigquery_api"}}] repair_candidates=`[2, 4, 43, 97, 8, 50, 6, 71, 17, 29, 38, 99]`, repair_targets=`[23]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "openstack-infra/shade/shade/tests/unit/base.py", "license": "apache-2.0", "note": "license: bigquery_api"}}] repair_candidates=`[57, 100, 111, 118, 143, 61, 72, 94, 126, 135, 155, 2, 13, 43, 47, 82, 124, 149, 153, 164, 174, 20, 33]`, repair_targets=`[26]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "debrouwere/facebook-insights/facebookinsights/utils/api.py", "license": "isc", "note": "license: bigquery_api"}}] repair_candidates=`[2, 12, 23, 32, 5, 29]`, repair_targets=`[]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "viewfinderco/viewfinder/backend/www/admin/service.py", "license": "apache-2.0", "note": "license: bigquery_api"}}] repair_candidates=`[6, 24, 4, 22, 2, 17, 27]`, repair_targets=`[]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "ofermend/medicare-demo/socialite/jython/Lib/site-packages/pysvg/animate.py", "license": "apache-2.0", "note": "license: bigquery_api"}}] repair_candidates=`[2, 13, 18, 23]`, repair_targets=`[]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "openstack/ironic/ironic/tests/unit/db/utils.py", "license": "apache-2.0", "note": "license: bigquery_api"}}] repair_candidates=`[9, 14, 30, 35, 17, 26]`, repair_targets=`[]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "kurtiss/monque/monque/base.py", "license": "bsd-3-clause", "note": "license: bigquery_api"}}] repair_candidates=`[4, 11, 15, 18, 33, 47, 56, 67, 73, 78, 2, 22, 27, 80]`, repair_targets=`[]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "darcyliu/storyboard/boto/__init__.py", "license": "mit", "note": "license: bigquery_api"}}] repair_candidates=`[2, 25, 30, 6, 27]`, repair_targets=`[]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "sympy/sympy/sympy/functions/elementary/trigonometric.py", "license": "bsd-3-clause", "note": "license: manual_eval"}}] repair_candidates=`[7, 22, 37, 4, 33, 40, 14, 31, 2, 16, 20]`, repair_targets=`[]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "openstack/rally/rally/plugins/openstack/context/fuel.py", "license": "apache-2.0", "note": "license: bigquery_api"}}] repair_candidates=`[7, 64, 62, 2, 9, 26, 46, 60, 67]`, repair_targets=`[16]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "nltk/nltk/nltk/tgrep.py", "license": "apache-2.0", "note": "license: manual_eval"}}] repair_candidates=`[100, 110, 60, 62, 67, 178, 2, 6, 17, 38, 57, 64, 89, 121, 4, 114, 172, 85, 87, 93, 117, 119, 127, 73, 106, 139, 82, 102]`, repair_targets=`[134]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "openstack/compass-core/compass/db/api/metadata_holder.py", "license": "apache-2.0", "note": "license: bigquery_api"}}] repair_candidates=`[39, 54, 63, 207, 41, 71, 79, 91, 103, 115, 127, 139, 151, 163, 175, 187, 212, 28, 61, 205, 215, 221, 2, 16, 25, 44]`, repair_targets=`[]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "StackStorm/st2/st2client/tests/unit/test_shell.py", "license": "apache-2.0", "note": "license: bigquery_api"}}] repair_candidates=`[5, 22, 25, 2, 16, 28]`, repair_targets=`[]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "django/django/tests/model_fields/models.py", "license": "bsd-3-clause", "note": "license: manual_eval"}}] repair_candidates=`[2, 12, 22, 5, 28, 31]`, repair_targets=`[]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "thumbor/thumbor/vows/gif_engine_vows.py", "license": "mit", "note": "license: bigquery_api"}}] repair_candidates=`[2, 13, 19, 22, 28]`, repair_targets=`[]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "ralphm/wokkel/wokkel/test/test_muc.py", "license": "mit", "note": "license: bigquery_api"}}] repair_candidates=`[39, 87, 134, 81, 2, 26, 35, 55, 68, 100, 106, 110, 119, 128, 138, 156, 160, 166, 184, 192, 23, 65, 94, 132, 188, 53, 72, 77, 98, 115, 205, 122, 146, 150, 174, 178, 200]`, repair_targets=`[47]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "JoelBender/bacpypes/py27/bacpypes/bvll.py", "license": "mit", "note": "license: bigquery_api"}}] repair_candidates=`[2, 20, 32, 40, 46, 9, 26, 4, 29, 50]`, repair_targets=`[]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "openstack/python-sticksclient/sticksclient/common/http.py", "license": "apache-2.0", "note": "license: bigquery_api"}}] repair_candidates=`[2, 12, 21, 4, 18]`, repair_targets=`[]`
[{"datasetProvenance": {"datasetName": "ETHPy150Open", "filepath": "kivy/python-for-android/pythonforandroid/recipe.py", "license": "mit", "note": "license: bigquery_api"}}] repair_candidates=`[2, 12, 22, 31, 5, 28]`, repair_targets=`[]`
VHellendoorn commented 3 years ago

Thanks for reaching out! This is indeed a known bug in this version of the dataset; we are planning to role out a revision in the near future (though likely not before the NeurIPS deadline). In short, it stems from a tokenization issue where e.g. function headers were converted into a single token (e.g. def foo(), so that when foo was used as the buggy variable, the target could not be found.

This issue affects slightly under 1% of samples in this dataset. While it should not affect localization accuracy, it does make achieving perfect repair accuracy (anything over ~99%) impossible. In practice, current models struggle to exceed 80% (joint) accuracy, so it should not prevent significant innovations. This applies to the results in both the GREAT paper and the public replication package, so comparison with those numbers when keeping these examples as-is should be sound.

Our apologies for the inconvenience; hope this helps. -Vincent

mallamanis commented 3 years ago

Thanks, this makes sense :)