manishshettym / codescholar

codescholar: growing programs graphs idiomatically for API usage examples
10 stars 0 forks source link

Init search does not care about the quality of the initial seeds #21

Closed manishshettym closed 8 months ago

manishshettym commented 8 months ago

https://github.com/tart-proj/codescholar/blob/4df46919b13be4ca36f8f09b3f0fda087491396d/codescholar/search/init_search.py#L65C1-L66C21

In the lines above, search initialization just picks the "first" max_init_beams examples that have the seed in them. This can affect the quality of idioms we get. Can we do better?

manishshettym commented 8 months ago

Better solution for performance and quality:

  1. Choose K data points that are "representative" seed examples.
  2. Representative = cluster the examples and choose the centroids
  3. K = number of clusters

performance = now instead of picking a large (max_init_beams) number of seeds to overcome for potentially poor quality, we directly pick a few (K) number of seeds of representative quality.