scrosseye / ELLIPSE-Corpus

the English Language Learner Insight, Proficiency and Skills Evaluation (ELLIPSE) Corpus
9 stars 3 forks source link

On the relationship between the ELLIPSE corpus and the ELL competition dataset on Kaggle #1

Open YpLarryWang opened 1 year ago

YpLarryWang commented 1 year ago

Hello! Thank you very much for open sourcing your data, which is of great significance to the research in the field of intelligent writing assessment!

According to the "Dataset Description" in the "data" section of the ELL competition on Kaggle, the competition data on Kaggle is directly sourced from the ELLIPSE corpus. However, I currently have two questions:

  1. The format of text_id in the ELLIPSE corpus and the text_id in the ELL competition dataset on Kaggle are different. The competition text_id in the ELLIPSE corpus are all 10-digit numbers, while the text_id in the ELL competition dataset on Kaggle are 12 letter or digit combinations. As a result, users cannot easily determine which samples from the ELLIPSE corpus are present in the Kaggle ELL competition data and which are not.
  2. The number of entries in this corpus does not match the description on kaggle. The ELLIPSE_Final_github.csv file in the ELLIPSE corpus has a total of 6482 samples, while the ELL competition dataset on Kaggle has a total of 3914 samples (3911 in the training set and 3 in the test set). At the same time, according to the "Dataset Description" in the "data" section of the ELL competition on Kaggle, "The full test set comprises about 2700 essays." However, 6482-3914=2568, which is quite different from 2700. Could you explain how this difference came about?

Thank you so much! Best regards!

scrosseye commented 1 year ago

Thanks for bringing up this issue. I have replaced the file IDs to those used in the Kaggle competition. The full dataset used in the Kaggle competition was 6,482 essays. There may have been some misquoted numbers on the Kaggle website.

Scott

YpLarryWang commented 1 year ago

Thanks for bringing up this issue. I have replaced the file IDs to those used in the Kaggle competition. The full dataset used in the Kaggle competition was 6,482 essays. There may have been some misquoted numbers on the Kaggle website.

Scott

Thanks Scott! I have checked the latest version of the corpus, it's much easier to match samples using the new csv file with text_id_kaggle, but I'm afraid there is still a little problem with it.

There are 27 samples in the latest ELLIPSE_Final_github.csv file for which text_id_kaggle is represented using scientific notation, such as 8.33E+11 (Former British Minister Winston Churchill once...).

These scientific notations are also string in csv, thus cannot be restored to their original form (e.g. 8.33E+11 will be converted to 833000000000, which should not be a real text id).

Could you or your team please confirm the problem with the data type of text_id_kaggle?

Best regards, Yupei