Open YpLarryWang opened 1 year ago
Thanks for bringing up this issue. I have replaced the file IDs to those used in the Kaggle competition. The full dataset used in the Kaggle competition was 6,482 essays. There may have been some misquoted numbers on the Kaggle website.
Scott
Thanks for bringing up this issue. I have replaced the file IDs to those used in the Kaggle competition. The full dataset used in the Kaggle competition was 6,482 essays. There may have been some misquoted numbers on the Kaggle website.
Scott
Thanks Scott! I have checked the latest version of the corpus, it's much easier to match samples using the new csv file with text_id_kaggle
, but I'm afraid there is still a little problem with it.
There are 27 samples in the latest ELLIPSE_Final_github.csv
file for which text_id_kaggle
is represented using scientific notation, such as 8.33E+11 (Former British Minister Winston Churchill once...).
These scientific notations are also string in csv, thus cannot be restored to their original form (e.g. 8.33E+11 will be converted to 833000000000, which should not be a real text id).
Could you or your team please confirm the problem with the data type of text_id_kaggle
?
Best regards, Yupei
Hello! Thank you very much for open sourcing your data, which is of great significance to the research in the field of intelligent writing assessment!
According to the "Dataset Description" in the "data" section of the ELL competition on Kaggle, the competition data on Kaggle is directly sourced from the ELLIPSE corpus. However, I currently have two questions:
text_id
in the ELLIPSE corpus and thetext_id
in the ELL competition dataset on Kaggle are different. The competitiontext_id
in the ELLIPSE corpus are all 10-digit numbers, while thetext_id
in the ELL competition dataset on Kaggle are 12 letter or digit combinations. As a result, users cannot easily determine which samples from the ELLIPSE corpus are present in the Kaggle ELL competition data and which are not.ELLIPSE_Final_github.csv
file in the ELLIPSE corpus has a total of6482
samples, while the ELL competition dataset on Kaggle has a total of3914
samples (3911
in the training set and3
in the test set). At the same time, according to the "Dataset Description" in the "data" section of the ELL competition on Kaggle, "The full test set comprises about2700
essays." However,6482-3914=2568
, which is quite different from2700
. Could you explain how this difference came about?Thank you so much! Best regards!