Preprocees data for stance classification

AndreeaCoaja commented 2 years ago

Hi everyone!

Before I start implementing stance classification I need to preprocess the dataset which is given to us. In this dataset the result is encoded as 0: No stance, 1: Neutral, 2: Pro first object, 3: Pro second object. In the dataset there are available also object 1 and object 2. I thought i could take the objects as labels to predict but I think this will lead to unbalanced dataset. Do you maybe have any ideas on how I could process better the dataset?

ashishrana160796 commented 2 years ago

I would say taking two sequences [s1, s2] or [query, doc] as inputs and using AutoModelForSequenceClassification with RoBERTA representations to produce the stance outputs('0: No stance, 1: Neutral, 2: Pro' ) would be a good starting idea. I am not sure of the usage of objects in the stance classification actually. So, if you find anything that helps in stance classification do try that approach as well. And let's say, if we have unbalanced class problem for classification please refer to the two-step classification used in this paper: https://arxiv.org/abs/2104.11572

Finally, if you face any issues with pre-processing part, we can discuss that aspect too.

ashishrana160796 commented 2 years ago

Hi @AndreeaCoaja,

Q1. I wanted to ask, whether the merged dataset is available to us, after the blue screen error? If yes, can you share that merged dataset. Q2. If not, did you tried merging the dataset on google colab. Whether it is possible to merge the dataset on google colab when we only upload zipped files & unzip only the session memory not on google drive?

Thanks!

AndreeaCoaja commented 2 years ago

Hi @ashishrana160796!

I have the merged dataset from Yahoo which is about 12GB. Now there is also the other dataset from touche with which the merge has to be done. When I tried doing this, my lapotp crashed. I didnt used googlecolab because I didn't think about zipping the dataset. I'm trying now to zip it and upload. Afterwards I try to merge them. But how is this going to help for training if the labels are 0, 1, 2, 3?

ashishrana160796 commented 2 years ago

Hello @AndreeaCoaja!

I wasn't convinced on the modeling strategy you opted for training the model for answer stance prediction. I have usually seen such problems being formulated as entailment task problems. Like the way, i have mentioned in the below stated papers and especially the paper by QMUL-SDS Team. I have also just added a sample stance prediction code for your help as well.

I would say taking two sequences [s1, s2] or [query, doc] as inputs and using AutoModelForSequenceClassification with RoBERTA representations to produce the stance outputs('0: No stance, 1: Neutral, 2: Pro' ) would be a good starting idea. I am not sure of the usage of objects in the stance classification actually. So, if you find anything that helps in stance classification do try that approach as well. And let's say, if we have unbalanced class problem for classification please refer to the two-step classification used in this paper: https://arxiv.org/abs/2104.11572 Finally, if you face any issues with pre-processing part, we can discuss that aspect too.

I do understand, getting no result is frustrating but I think this design might just work out! :bulb:

Second, if we follow the above modelling design of let's say QMUL-SDS team which would be the most promising and easiest one to implement. For this, we definitely need the yahoo merged datasets for more stance prediction model training data. As currently it stands limited only to 455 entries only with limited number of 'NEUTRAL' and 'NO' classes.

The design, what I am thinking based on QMUL-SDS's implementation is to train 3 stance prediction models.

First (the model I have committed to my repo, notebook link), which is a object detector: it detects whether the object is present in the answer document for given query or not. That's it! This the first step from which two different stance prediction models will connect.
Second Stance Prediction Model, object classifier: This stance prediction module will take detected object sentences w/ 'OBJECT' labels from first model and will again predict object 1 or 2 in a binary form from that data. The training input dataset for this model will be a subset of the original one only containing 'OBJECT_ONE' and 'OBJECT_TWO' labels only.
Similarly, Third Stance Prediction Model, neutral classifier: This model like above ones will differentiate between neutral and no stance. This will take 'NOOBJECT' labels as input and will predict 'NEUTRAL' or 'NO' stance as output. The training input dataset for this model will be subset of original containing only neutral and no stance related documents for given query.

With this design, I am quite sure that we will be able to solve 4 label problem and use pre-trained models. As pre-trained models don't work for 4 labels actually and the models will give error. Additionally, as would have seen we would quite a lot of data if want to make these models work that's why Yahoo dataset merging is very important.

I came up with solution as this one is I guess easier to implement, if you have anything else in mind. Please, feel free to implement it and the performance currently is quite average but with Yahoo dataset it will improve. Also, there is an option to use 2 stance prediction model as well too. Like, first one for ['NO', 'NO_OBJECT', 'OBJECT'] and then do the ['OBJECT_ONE' and 'OBJECT_TWO'] object classification after it. I guess, we have to compare and see which one works better for us.

I have added the code notebook in the repository as well. I hope this information will help you out and you can complete this task by Tuesday. Also, please feel free to involve Ahmed into this! cool!

ashishrana160796 commented 2 years ago

Some additional reference implementations, that would be helpful too, I guess. You would have to work separately to prepare the prediction output files though.

* SCIVER: https://github.com/allenai/scifact/blob/master/verisci/inference/label_prediction/transformer.py
* QMUL: https://github.com/XiaZeng0223/sciverbinary/tree/main/label/training
* RERRFACT: https://github.com/ashishrana160796/RerrFact/blob/main/training/Label-prediction.ipynb

AndreeaCoaja commented 2 years ago

Morning! I have just started to merge the datasets but here gives the same error... Here is the link to the file from colab: https://drive.google.com/file/d/1NEHEC8dVnwDEmyLeem1FyRj2JnZyUWfA/view?usp=sharing

I have read your code and this approach makes more sense for me too! Really nice idea

ashishrana160796 commented 2 years ago

Hello @AndreeaCoaja,

I think it is not a good idea to merge the dataset even though the instructions have written it in the script. Better use the below code from the process_stance_dataset jupyter notebook to process the dataset in its splitted form.

idx=""
question=[]
ids=[]
answers = []
ans_niklas=[]
ans_found=[]
uris=[]
found=""
for file in tqdm(os.listdir("full")): #path to the directory with the split dataset (e.g., the folder is called full)
    print(file)
    if '-' in file:
        tree = ET.parse("full/"+file)
        for r in tree.iter(tag='vespaadd'):
            r = r[0]
            uri=r.findtext('uri')
            if int(uri) in i:
                for idx, row in df.iterrows():
                    if int(uri)==row.id:
                        df.at[idx,'question'] = clean(r.findtext('subject'))
                        df.at[idx,'answer'] = clean(r.findtext('bestanswer'))

This snippet I guess process the data in its splitted form and you can go ahead w/ further splits into 4 files as well. If that crashes the colab environment. I think that working on the data with splitted files and shared stance notebook (w/ stance dataset) code will work.

Can you try that approach and share what happened? Also, if the dataset gets successfully built do upload the dataset as well. Thanks!

AndreeaCoaja commented 2 years ago

Hey @ashishrana160796! Before I merged them, I tried to run this script for each part, but the problem is that the first part of the dataset has as last record in the xml file a missing closing tag( ">" ), which is present in the second part and after joining them the problem is not present anymore.

ashishrana160796 commented 2 years ago

Hello @AndreeaCoaja, Is is possible to append that closing tag in the end of the first file. & Add the starting tags onto the second file & get the thing working in parts ? I am not sure of this thing too but just asking.

AndreeaCoaja commented 2 years ago

I tried in different ways to do this, obviously the first try was to delete it manually, but I couldn't open this file as write, only read. Then I tried doing this in python but when I tried to open it with different methods (by reading xml or by using Element Tree parser) it gave me error in the end because it doesn't respect the structure of the file. Here's a screenshot:

Wouldn't be an idea to use the server from uni? I don't know if you managed to use it or not for the models but really I don't know how to continue with this task. Like it shouldn't be hard but I am not making any progress with it and it gets really frustrating...

Now I'm trying to do something in R to see if it's possible and I let you know.

ashishrana160796 commented 2 years ago

Hello @AndreeaCoaja, I'll say please work on the making scripts for the stance prediction model like training & prediction scripts, that can be merged with the final outputs for our model. Also, these won't our final models but just the code structure will be ready & we would then have to execute them that's all. I don't think so that it will work with R as well because the file size requires languages like Java, C/C++ for processing.

Hi @softgitron, can you download the yahoo dataset and make a complete merged dataset for the yahoo files. As it will require higher CPU resources and a language like Java or C++ for fast processing with file handling. Would it be possible?

ashishrana160796 commented 2 years ago

Hello @AndreeaCoaja & @softgitron

I have processed the yahoo dataset and I guess it looks OK! You can look into it and check everything from your end as well Andreea. Here, I am sharing the working link for the notebook as well. I have uploaded the dataset on my branch and have attached it onto the issue too.

We can now move forward with the model training part. Please, check which of the above modeling approaches work better (3-step or 2-step) for stance prediction and do try to make a prediction script as well. Let's try to get some valuable prediction and performance related insights for stance prediction models by Tuesday. cool, see ya!

processed-touche22-task2-stance-dataset.txt

AndreeaCoaja commented 2 years ago

This is perfect! I am starting training 2 models for the 2 step approach and while they are training, I'll try doing the prediction script. When I'll have the results from these 2 models I will share with you and see if it's necessarry to do also the 3 steps approach. This is the first model to be trained: the one you have written @ashishrana160796 (stance-classification-two-step-touche-22) This is the second model: https://colab.research.google.com/drive/1O-fOrYFo-_3892tMevIurojFTS2hIdQJ?usp=sharing

Update: I oversampled data for No and Neutral class and rerun the training again.

ashishrana160796 commented 2 years ago

cool @AndreeaCoaja, all the best with the results on the stance prediction tasks! & I hope this 2-step modeling approach works well. cool!

ashishrana160796 commented 2 years ago

Stance prediction is working correctly, closing the issue!

softgitron / LeviRank

Preprocees data for stance classification #2