Closed manishshettym closed 8 months ago
Better solution for performance and quality:
K
data points that are "representative" seed examples.performance = now instead of picking a large (max_init_beams) number of seeds to overcome for potentially poor quality, we directly pick a few (K) number of seeds of representative quality.
https://github.com/tart-proj/codescholar/blob/4df46919b13be4ca36f8f09b3f0fda087491396d/codescholar/search/init_search.py#L65C1-L66C21
In the lines above, search initialization just picks the "first" max_init_beams examples that have the seed in them. This can affect the quality of idioms we get. Can we do better?