jind11 / TextFooler

A Model for Natural Language Attack on Text Classification and Inference
MIT License
485 stars 79 forks source link

Questions about string cleaning #38

Open Opdoop opened 3 years ago

Opdoop commented 3 years ago

Thanks for this solid work. In the clean_str, it seems that Every dataset is lower cased except for TREC but in the example, in Table 6 the sentence is cased. This looks like a conflict to me. https://github.com/jind11/TextFooler/blob/6aeec20f9fd37f5865e580de669e1263a7cd49d3/dataloader.py#L10 Also in clean_str say Tokenization/string cleaning for all datasets except for SST. Did you train the model on a cleaned uncased dataset but test it on a cased raw dataset? But the split 1000 dataset in 'data' is uncased. I'm really confused. Is there something I have missed? I apologize that I didn't go through your code before directly asking the question. That would be very generous and helpful. Thanks in advance~

jind11 commented 3 years ago

hi, I am so sorry for the late response. Actually the attack is conducted on uncased text in experiments and I formatted the text to cased one in Table 6 just for better looking in the paper. Let me know if you have more questions.