huggingface / cosmopedia

Apache License 2.0
441 stars 44 forks source link

Fantastic work! Is code data considered in Cosmopedia? #11

Open UniverseFly opened 8 months ago

UniverseFly commented 8 months ago

Wow, this is super cool work, and thanks for open sourcing everything!! I wonder if cosmopedia tries incorporating code data as seeds to rephrase them into high-quality data? We did some explorations in Magicoder for instruction tuning, but in our case, the "rephrasing" requires a very delicate prompt design, so I am quite excited about this development and would love to know any insights towards rephrasing code instructions.

loubnabnl commented 8 months ago

Thank you! Yes we're planning to try generating code data, we can try MagiCoder instructions to generate some coding tutorials (in a similar way to how we used UltraChat & OpenHermes). But it might require a few iterations since it really depends on the coding performance of the LLM we use, similarly to how we've seen issues with Math reasoning.