mohaddad / COVID-FAKES

Bilingual (Arabic/English) COVID-19 Twitter dataset for misleading information detection
18 stars 5 forks source link

Please map the column names to their position in the Ar CSV files and provide the original assigned labels #4

Closed Sue-Fwl closed 3 years ago

Sue-Fwl commented 3 years ago

Greetings, Thank you for keeping the access open to other researchers.

Please map the column names to their position in the Ar CSV files and provide the original assigned labels, because the files have no headers and the mapping isn't mentioned in neither GitHub nor the paper.

Sue-Fwl commented 3 years ago

TweetID;C1_1;C1_2;C1_3;c1_4;c1_5;c1_6;c1_7;c2_1;c2_2;c2_3;C2_4;C2_5;C2_6;C2_7;C3_1;C3_2;C3_3;C3_4;C3_5;C3_6;C3_7;C4_1;C4_2;C4_3;C4_4;C4_5;C4_6;C4_7;C5_1;C5_2;C5_3;C5_4;C5_5;C5_6;C5_7;C6_1;C6_2;C6_3;C6_4;C6_5;C6_6;C6_7;C7_1;C7_2;C7_3;C7_4;C7_5;C7_6;C7_7;C8_1;C8_2;C8_3;C8_4;C8_5;C8_6;C8_7;C9_1;C9_2;C9_3;C9_4;C9_5;C9_6;C9_7;C10_1;C10_2;C10_3;C10_4;C10_5;C10_6;C10_7;C11_1;C11_2;C11_3;C11_4;C11_5;C11_6;C11_7;C12_1;C12_2;C12_3;C12_4;C12_5;C12_6;C12_7;C13_1;C13_2;C13_3;C13_4;C13_5;C13_6;C13_7;C_ENS

These were the headers of the EN data files but the Arabic ones had none. But the label (Label: (Real=1, Misleading=0)) isn't found in both datasets

mohaddad commented 3 years ago

Dear Sue-Fwl,

Labels: (Real=1, Misleading=0).

-For annotating the dataset we trained a binary classification model for 13 different machine learning algorithms, using 7 different feature extraction (term weighting) techniques with each of the used algorithms.

-For generalization, we are reporting the obtained class of each Tweet with each of the 7 feature extraction techniques (Term Frequency (TF), Term Frequency Inverse Document Frequency (TF-IDF)-(unigram, bigram, trigram, N-gram, character level), and Word Embedding), for each of the used 13 machine learning algorithms (Decision Tree (DT), k-Nearest Neighbor (kNN), Logistic Regression (LR), Linear Support Vector Machines (LSVM), Multinomial Naive Bayes (MNB), Bernoulli Naive Bayes (BNB), Perceptron, Neural Network (NN), Ensemble Random Forest (ERF), Extreme Gradient Boosting (XGBoost), Bagging Meta-Estimator (BME), AdaBoost, and Gradient Boosting (GB)).

-The English part of the dataset consists of 31 ".CSV" files. Each file contains 100,000 TweetId and their corresponding classes. -The Arabic part of the dataset consists of 22 ".CSV" files. Each file contains 10,000 TweetId and their corresponding classes. -The techniques are as follows: C1: Decision Trees. C2: Multinomial Naive Bayes. C3: Bernoulli Naive Bayes. C4: Logistic Regression. C5: k-Nearest Neighbors. C6: Perceptron. C7: Multilayer Perceptron. C8: Linear SVM. C9: Random Forest. C10: Bagging meta-estimator. C11: XGBoost. C12: AdaBoost. C13: Gradient Boosting (GBM).

-The used Feature Extraction Techniques are as follows: 
       1: Term Frequency (TF). 
       2: Term Frequency Inverse Document Frequency (TF-IDF) - unigram. 
       3: Term Frequency Inverse Document Frequency (TF-IDF) - bigram. 
       4: Term Frequency Inverse Document Frequency (TF-IDF) - trigram. 
       5: Term Frequency Inverse Document Frequency (TF-IDF) - N-gram. 
       6: Term Frequency Inverse Document Frequency (TF-IDF) - character level. 
       7: Word Embedding. -The column title indicates the technique used for assigning the corresponding class label 

       (i.e., "C1_1---->Decision Trees using Term Frequency (TF)", "C4_7---->Logistic Regression using Word Embedding"). 

I recommend this paper to take a look at too. Moreover, the headers for the amArabic dataset are the same as of the English dataset.

warm regards,

Mohamed regards,

Mohamed Elhaddad

On Jan 4, 2021, at 4:36 AM, Sue-Fwl notifications@github.com wrote:

 Greetings, Thank you for keeping the access open to other researchers.

Please map the column names to their position in the CSV files, because the files have no headers and the mapping isn't mentioned in neither GitHub nor the paper.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

Sue-Fwl commented 3 years ago

Appreciate your time and prompt reply,

I apologize if my question wasn't clear enough,

when I mentioned the "labels (misleading/Real) / original assigned labels" I meant "Target labels" that the training was carried out on, not the Predicted labels. Since to compare my work to COVID-FAKES predicted classes I need the Target labels. I guess trying to find the Target Labels col was the cause of my misunderstanding of the order of columns.

Finally, just for reassurance, does the column "C_ENS" stands for end of string?

mohaddad commented 3 years ago

Hi Sue-Fwl,

Our work originally divided into two phases:

The model-building phase, aimed to train and evaluate a binary classification model for 13 different ML algorithms, combined with seven different feature extraction techniques to predict whether a given piece of information is Real or Misleading, and The annotation phase, where the trained model was used to assign ground-truth labels (Real information and Misleading information) to a set of 3,047,110 tweets related to COVID-19. In both phases, the data was subject to the same preprocessing and feature engineering steps (see the IEEE access paper for more details).

For the model-building phase (training and evaluation), we collected a set of ground-truth information related to COVID-19 by scraping the official websites of the WHO, UNICEF, and UN as sources of reliable information. Besides, enriched the collected ground truth with COVID-19 pre-checked facts from various fact-checking websites, including snopes.com, washingtonpost.com/news/fact-checker, and politifact.com. (see the IEEE access paper for full details).

In total, we have collected 7,486 ground truth samples—1,602 (21.4%) misleading information and 5,884 (78.6%) Real information—with about 75% of the samples having up to 200 characters and up to 30 words. The sample was then split into 80% training and 20% test datasets, and the ML models were trained and evaluated using 5-fold cross-validation with 12 performance metrics. (see the IEEE access paper for full details).

The reported results are highly promising and suggest the validity of all tested models (F-scores greater than 95%), with the Neural Networks, Decision Tree, and LogisticRegression classifiers reaching superior overall performance.

lastly, we deployed a voting ensemble method to generate final ground truth labels based on the output of all tested classifiers.

In the annotation phase (annotation), the trained models were then used to assign ground-truth labels to a set of tweets in both Arabic and English languages. (see the IEEE access paper for more details).

These tweets were continuously collected from February 04, 2020 (four days after the outbreak was declared a Public Health Emergency ofInternational Concern by the World Health Organization) to March 10, 2020 (one day before the outbreak was declared a Pandemic).

For instance, the English dataset includes 3,047,110 tweet IDs along with the respective ground truth labels (real/misleading) generated by each of the 13 classifiers tested via the seven feature extraction techniques, as well as the voting ensemble method.

I gave it all to researchers to make analysis and test their techniques. The last column represents the resulted voting between all its previous columns. I suggest choosing only three classifiers with the best performance (from your point of view), instead of voting between all the classifiers.

Warm regards,

Mohamed K. Elhadad

On Jan 5, 2021, at 7:26 AM, Sue-Fwl notifications@github.com wrote:

 Appreciate your time and prompt reply,

I apologize if my question wasn't clear enough,

when I mentioned the "labels (misleading/Real) / original assigned labels" I meant "Target labels" that the training was carried out on, not the Predicted labels. Since to compare my work to COVID-FAKES predicted classes I need the Target labels. I guess trying to find the Target Labels col was the cause of my misunderstanding of the order of columns.

Finally, just for reassurance, does the column "C_ENS" stands for end of string?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

Sue-Fwl commented 3 years ago

Grateful for your thorough explanation and prompt replies,

I will make sure to do as you suggested.

Then, does it mean that you don't have the Target labels of the 220,000 tweet that were annotated through the model? Is the model following an unsupervised learning method? because if not, the predicted data needs target labels or how did you calculate the error rate? Quoting,

where the trained model was used to assign ground-truth labels (Real information and Misleading information) to a set of 3,047,110 tweets related to COVID-19.

The classifiers used were mostly supervised ML classifiers that need "target labels" to train and assign ground-truth labels to data. And is what I need when comparing any model to your model.

The reason of asking specifically for the "Target labels of training/testing" is that training on your model's predicted outcomes -csv provided labels- will increase the error rate in my model by the percentage of error in your model. Furthermore, as I'm rereading your paper I noticed no mention of the Accuracy, Precision & Recall, nor F-score, would you please provide me with the them?(to test my techniques as you mentioned)

Also, if available please provide me with the mentioned Training Target labels.

Really appreciate your patience and consideration,

mohaddad commented 3 years ago

Dear Sue-Fwl,

Please consider reading this paper,

https://ieeexplore.ieee.org/document/9189767

I think that you have something missing from my proposed technique!

I used, what I have named, ground-truth data collected from WHO, UNICEF, UN, and different fact-checking websites, for training the model for different ML algorithms using different feature extraction techniques. Next, I used the obtained models to classify (label) the collected Tweets. These Tweets originally were not labeled (has no original label). Each column in the data I have provided represents the obtained label for Tweets when using a particular feature extraction technique with a certain classification algorithm.

For example, the first column represents the Tweets’ Id. While the second column represents the obtained labels when using Term Frequency (TF) as a feature extraction technique while deploying the Decision Tree (DT) as the classification algorithm.

Warm regards, Mohamed Elhaddad

On Jan 6, 2021, at 1:20 AM, Sue-Fwl notifications@github.com wrote:  Grateful for your thorough explanation and prompt replies,

I will make sure to do as you suggested.

But should I take it as you don't have the Target labels of the 220,000 tweet that were annotated through the model? Is the model following an unsupervised learning method? because if not the predicted data needs target labels or how did you calculate the error rate? Quoting,

where the trained model was used to assign ground-truth labels (Real information and Misleading information) to a set of 3,047,110 tweets related to COVID-19.

The classifiers used were mostly supervised ML classifiers that need "target labels" to train and assign ground-truth labels to data. And is what I need when comparing any model to your model.

The reason of asking specifically for the "Target labels of training/testing" is that training on your model's predicted outcomes -csv provided labels- will increase the error rate in my model by the percentage of error in your model. Furthermore, as I'm rereading your paper I noticed no mention of the Accuracy, Precision & Recall, nor F-score, would you please provide me with the them?(again to compare the models)

Also, if available please provide me with the mentioned Training Target labels.

Really appreciate your patience and consideration,

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

Sue-Fwl commented 3 years ago

Many thanks, What lead me to the data was reading "COVID-19-FAKES: A Twitter (Arabic/English) Dataset for Detecting Misleading Information on COVID-19" . I didn't come across the one you linked my bad, skimming through it I see the evaluation results.

I think that you have something missing from my proposed technique!

Indeed, I didn't understand why there was no target labels, I'll read through and comment back,

Best Regards,