Error when getting the baselines with sequence_labeling

MinionAttack commented 3 years ago

Hi, I'm trying to run the get_baselines.sh inside baselines/sequence_labeling but I get this error during the execution of the script:

Namespace(ANNOTATION='targets', BATCH_SIZE=50, DATADIR='darmstadt_unis', DEVDATA=False, EMBEDDINGS='../graph_parser/embeddings/18.zip', HIDDEN_DIM=100, NUM_LAYERS=1, OUTDIR='saved_models', TRAIN_EMBEDDINGS=False, save_all=False)
Traceback (most recent call last):
  File "extraction_module.py", line 198, in <module>
    annotation=annotation
  File "/home/iago/Escritorio/SemEval-2022 Shared Task 10: Structured Sentiment Analysis/baselines/sequence_labeling/utils.py", line 215, in get_split
    return Split(self.open_split(filename, lower_case, annotation=annotation))
  File "/home/iago/Escritorio/SemEval-2022 Shared Task 10: Structured Sentiment Analysis/baselines/sequence_labeling/utils.py", line 201, in open_split
    torch.LongTensor(self.label2idx.labels2idxs(item.targets, annotation="targets"))) for item in data]
  File "/home/iago/Escritorio/SemEval-2022 Shared Task 10: Structured Sentiment Analysis/baselines/sequence_labeling/utils.py", line 201, in <listcomp>
    torch.LongTensor(self.label2idx.labels2idxs(item.targets, annotation="targets"))) for item in data]
  File "/home/iago/Escritorio/SemEval-2022 Shared Task 10: Structured Sentiment Analysis/baselines/sequence_labeling/utils.py", line 98, in labels2idxs
    return [self.label2idx[annotation][label] for label in labels]
  File "/home/iago/Escritorio/SemEval-2022 Shared Task 10: Structured Sentiment Analysis/baselines/sequence_labeling/utils.py", line 98, in <listcomp>
    return [self.label2idx[annotation][label] for label in labels]
KeyError: 'B-targ-positive'
Namespace(ANNOTATION='expressions', BATCH_SIZE=50, DATADIR='darmstadt_unis', DEVDATA=False, EMBEDDINGS='../graph_parser/embeddings/18.zip', HIDDEN_DIM=100, NUM_LAYERS=1, OUTDIR='saved_models', TRAIN_EMBEDDINGS=False, save_all=False)
Traceback (most recent call last):
  File "extraction_module.py", line 198, in <module>
    annotation=annotation
  File "/home/iago/Escritorio/SemEval-2022 Shared Task 10: Structured Sentiment Analysis/baselines/sequence_labeling/utils.py", line 215, in get_split
    return Split(self.open_split(filename, lower_case, annotation=annotation))
  File "/home/iago/Escritorio/SemEval-2022 Shared Task 10: Structured Sentiment Analysis/baselines/sequence_labeling/utils.py", line 205, in open_split
    torch.LongTensor(self.label2idx.labels2idxs(item.expressions, annotation="expressions"))) for item in data]
  File "/home/iago/Escritorio/SemEval-2022 Shared Task 10: Structured Sentiment Analysis/baselines/sequence_labeling/utils.py", line 205, in <listcomp>
    torch.LongTensor(self.label2idx.labels2idxs(item.expressions, annotation="expressions"))) for item in data]
  File "/home/iago/Escritorio/SemEval-2022 Shared Task 10: Structured Sentiment Analysis/baselines/sequence_labeling/utils.py", line 98, in labels2idxs
    return [self.label2idx[annotation][label] for label in labels]
  File "/home/iago/Escritorio/SemEval-2022 Shared Task 10: Structured Sentiment Analysis/baselines/sequence_labeling/utils.py", line 98, in <listcomp>
    return [self.label2idx[annotation][label] for label in labels]
KeyError: 'B-exp-negative'
Namespace(BATCH_SIZE=50, DATADIR='darmstadt_unis', DEVDATA=False, EMBEDDINGS='../graph_parser/embeddings/18.zip', HIDDEN_DIM=100, LEARNING_RATE=0.001, NUM_LAYERS=1, OUTDIR='saved_models', POOLING='max', TRAIN_EMBEDDINGS=False, save_all=False)

And:

Namespace(ANNOTATION='sources', BATCH_SIZE=50, DATADIR='mpqa', DEVDATA=False, EMBEDDINGS='../graph_parser/embeddings/18.zip', HIDDEN_DIM=100, NUM_LAYERS=1, OUTDIR='saved_models', TRAIN_EMBEDDINGS=False, save_all=False)
Traceback (most recent call last):
  File "extraction_module.py", line 198, in <module>
    annotation=annotation
  File "/home/iago/Escritorio/SemEval-2022 Shared Task 10: Structured Sentiment Analysis/baselines/sequence_labeling/utils.py", line 215, in get_split
    return Split(self.open_split(filename, lower_case, annotation=annotation))
  File "/home/iago/Escritorio/SemEval-2022 Shared Task 10: Structured Sentiment Analysis/baselines/sequence_labeling/utils.py", line 193, in open_split
    data = torchtext.data.TabularDataset(data_file, format="json", fields={"sent_id": ("sent_id", sent_id), "text": ("text", text), "sources": ("sources", sources), "targets": ("targets", targets), "expressions": ("expressions", expressions)})
  File "/home/iago/anaconda3/envs/syncap/lib/python3.6/site-packages/torchtext/data/dataset.py", line 251, in __init__
    with io.open(os.path.expanduser(path), encoding="utf8") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'data/extraction/mpqa/train.json'
Namespace(ANNOTATION='targets', BATCH_SIZE=50, DATADIR='mpqa', DEVDATA=False, EMBEDDINGS='../graph_parser/embeddings/18.zip', HIDDEN_DIM=100, NUM_LAYERS=1, OUTDIR='saved_models', TRAIN_EMBEDDINGS=False, save_all=False)
Traceback (most recent call last):
  File "extraction_module.py", line 198, in <module>
    annotation=annotation
  File "/home/iago/Escritorio/SemEval-2022 Shared Task 10: Structured Sentiment Analysis/baselines/sequence_labeling/utils.py", line 215, in get_split
    return Split(self.open_split(filename, lower_case, annotation=annotation))
  File "/home/iago/Escritorio/SemEval-2022 Shared Task 10: Structured Sentiment Analysis/baselines/sequence_labeling/utils.py", line 193, in open_split
    data = torchtext.data.TabularDataset(data_file, format="json", fields={"sent_id": ("sent_id", sent_id), "text": ("text", text), "sources": ("sources", sources), "targets": ("targets", targets), "expressions": ("expressions", expressions)})
  File "/home/iago/anaconda3/envs/syncap/lib/python3.6/site-packages/torchtext/data/dataset.py", line 251, in __init__
    with io.open(os.path.expanduser(path), encoding="utf8") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'data/extraction/mpqa/train.json'
Namespace(ANNOTATION='expressions', BATCH_SIZE=50, DATADIR='mpqa', DEVDATA=False, EMBEDDINGS='../graph_parser/embeddings/18.zip', HIDDEN_DIM=100, NUM_LAYERS=1, OUTDIR='saved_models', TRAIN_EMBEDDINGS=False, save_all=False)
Traceback (most recent call last):
  File "extraction_module.py", line 198, in <module>
    annotation=annotation
  File "/home/iago/Escritorio/SemEval-2022 Shared Task 10: Structured Sentiment Analysis/baselines/sequence_labeling/utils.py", line 215, in get_split
    return Split(self.open_split(filename, lower_case, annotation=annotation))
  File "/home/iago/Escritorio/SemEval-2022 Shared Task 10: Structured Sentiment Analysis/baselines/sequence_labeling/utils.py", line 193, in open_split
    data = torchtext.data.TabularDataset(data_file, format="json", fields={"sent_id": ("sent_id", sent_id), "text": ("text", text), "sources": ("sources", sources), "targets": ("targets", targets), "expressions": ("expressions", expressions)})
  File "/home/iago/anaconda3/envs/syncap/lib/python3.6/site-packages/torchtext/data/dataset.py", line 251, in __init__
    with io.open(os.path.expanduser(path), encoding="utf8") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'data/extraction/mpqa/train.json'
Namespace(BATCH_SIZE=50, DATADIR='mpqa', DEVDATA=False, EMBEDDINGS='../graph_parser/embeddings/18.zip', HIDDEN_DIM=100, LEARNING_RATE=0.001, NUM_LAYERS=1, OUTDIR='saved_models', POOLING='max', TRAIN_EMBEDDINGS=False, save_all=False)
loading embeddings from ../graph_parser/embeddings/18.zip
Traceback (most recent call last):
  File "relation_prediction_module.py", line 269, in <module>
    "train.json"))
  File "/home/iago/Escritorio/SemEval-2022 Shared Task 10: Structured Sentiment Analysis/baselines/sequence_labeling/utils.py", line 154, in get_split
    return RelationSplit(self.open_split(filename, lower_case))
  File "/home/iago/Escritorio/SemEval-2022 Shared Task 10: Structured Sentiment Analysis/baselines/sequence_labeling/utils.py", line 145, in open_split
    data = torchtext.data.TabularDataset(data_file, format="json", fields={"sent_id": ("sent_id", sent_id), "text": ("text", text), "e1": ("e1", e1), "e2": ("e2", e2), "label": ("label", label)})
  File "/home/iago/anaconda3/envs/syncap/lib/python3.6/site-packages/torchtext/data/dataset.py", line 251, in __init__
    with io.open(os.path.expanduser(path), encoding="utf8") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'data/relations/mpqa/train.json'
Namespace(ANNOTATION='sources', BATCH_SIZE=50, DATADIR='multibooked_ca', DEVDATA=False, EMBEDDINGS='../graph_parser/embeddings/34.zip', HIDDEN_DIM=100, NUM_LAYERS=1, OUTDIR='saved_models', TRAIN_EMBEDDINGS=False, save_all=False)

Regards.

jerbarnes commented 3 years ago

Hi Iago,

Looks like this error was due to an inconsistency in the the polarity labels. The baseline code expects the polarity label to be title case ('negative' -> 'Negative'). This error was also present in MPQA. I've corrected this now in the preprocessing, but you'll have to rerun the preprocessing commands:

cd data/mpqa
bash process_mpqa.sh
cd ../darmstadt_unis
bash process_darmstadt.sh

After that, you should be able to run get_baselines.sh without a problem. Let me know if it works on your end.

MinionAttack commented 3 years ago

Hi,

MPQA works fine but I'm having problems with darmstadt when reading the universities.zip.

extracting: universities/basedata/DeVry_University_27_12-08-2007.txt  
  inflating: universities/basedata/DeVry_University_27_12-08-2007_words.xml  
  inflating: universities/basedata/DeVry_University_29_11-15-2007.txt  
  inflating: universities/basedata/DeVry_University_29_11-15-2007_words.xml  
  inflating: universities/basedata/DeVry_University_30_10-26-2007.txt  
  ..........
  inflating: universities/markables/University_of_Phoenix_Online_189_07-25-2004_SentenceOpinionAnalysisResult_level.xml  
  inflating: universities/customization/OpinionExpression_customization.xml  
  inflating: universities/customization/SentenceOpinionAnalysisResult_customization.xml  
sed: can't read : No such file or directory

Regards.

jerbarnes commented 3 years ago

Ok, the darmstadt error came about because of a fix for OSX users. I've corrected it now.

MinionAttack commented 3 years ago

Yes, that error has disappeared but it seems to have been covering another error:

Traceback (most recent call last):
  File "process_darmstadt.py", line 475, in <module>
    o = get_opinions(bfile, mfile)
  File "process_darmstadt.py", line 113, in get_opinions
    text += token + " "
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'

I don't know if it's related with the data, I've printed the idx and token variable and before the error it outputs:

word_128
I
word_129
also
word_130
was
word_131
duped
word_132
with
word_133
my
word_134
financial
word_135
aid
word_136
None

jerbarnes commented 3 years ago

Sorry, there was a bug in the sed command :/ How about now?

MinionAttack commented 3 years ago

Now it works! Thanks.

jerbarnes / semeval22_structured_sentiment

Error when getting the baselines with sequence_labeling #6