Duplications in evol_instruct_70k dataset

chintu619 commented 1 year ago

Firstly, thanks for publicly releasing the instruction dataset.

While looking through the dataset, I've noticed several examples where the output is already present in the instruction with barely any changes. Example (index 47821 from the dataset): Instruction:

Can you provide a table of 6 different tools used for data science along with their descriptions and features?
| Tool | Description | Features |
|------|-------------|----------|
| Python | A high-level programming language used for data analysis and machine learning | Large standard library, easy to learn, supports multiple paradigms |
| R | A programming language and software environment for statistical computing and graphics | Large community, extensive library of statistical and graphical methods, open-source |
| SQL | A domain-specific language used in programming and designed for managing data held in a relational database management system | Easy to learn, widely used, powerful for managing large amounts of structured data |
| Tableau | A data visualization and business intelligence tool used to create interactive dashboards and reports | User-friendly interface, drag-and-drop functionality, ability to connect to multiple data sources |
| Apache Spark | An open-source distributed computing system used for big data processing and machine learning | Fast processing speed, ability to handle large datasets, supports multiple languages |
| TensorFlow | An open-source software library for dataflow and differentiable programming across a range of tasks | Popular for machine learning and deep learning, supports distributed computing, flexible architecture |

Output:

Sure, here's a table of 6 different tools used for data science along with their descriptions and features:
| Tool | Description | Features |
|------|-------------|----------|
| Python | A high-level programming language used for data analysis and machine learning | Large standard library, easy to learn, supports multiple paradigms |
| R | A programming language and software environment for statistical computing and graphics | Large community, extensive library of statistical and graphical methods, open-source |
| SQL | A domain-specific language used in programming and designed for managing data held in a relational database management system | Easy to learn, widely used, powerful for managing large amounts of structured data |
| Tableau | A data visualization and business intelligence tool used to create interactive dashboards and reports | User-friendly interface, drag-and-drop functionality, ability to connect to multiple data sources |
| Apache Spark | An open-source distributed computing system used for big data processing and machine learning | Fast processing speed, ability to handle large datasets, supports multiple languages |
| TensorFlow | An open-source software library for dataflow and differentiable programming across a range of tasks | Popular for machine learning and deep learning, supports distributed computing, flexible architecture |

Indices of some similar samples: 17421, 17926, 24717, 29166, 41705, 61826, 62587

Is this due to a bug during the creation of the dataset? Was the WizardLM model trained with such examples?

victorsungo commented 1 year ago

@chintu619 Thanks for your kindly feedback.

Yes, we also found this issue on this version of training data, Evol-Instruct would output some repetitive content which have already appeared in the instruction, and WizardLM is like a chatterbox who asks and answers questions himself in a few cases...

We are focusing on improving the Evol-Instruct, and hope to solve the issue in the next version of WizardLM. After that, we would also open the code of Evol-Instruct and welcome for your suggestions about the failed evolution.

teknium1 commented 1 year ago

@chintu619 Thanks for your kindly feedback.

Yes, we also found this issue on this version of training data, Evol-Instruct would output some repetitive content which have already appeared in the instruction, and WizardLM is like a chatterbox who asks and answers questions himself in a few cases...

We are focusing on improving the Evol-Instruct, and hope to solve the issue in the next version of WizardLM. After that, we would also open the code of Evol-Instruct and welcome for your suggestions about the failed evolution.

Is the 300,000 dataset already existing? Can it be released?

nlpxucan / WizardLM

Duplications in evol_instruct_70k dataset #8