microsoft / GLUECoS

A benchmark for code-switched NLP, ACL 2020
https://microsoft.github.io/GLUECoS
MIT License
73 stars 58 forks source link

Trouble with dowloading the datasets #88

Open iglee opened 2 years ago

iglee commented 2 years ago

Hi, I'm trying to run ./download_data.sh $SUBSCRIPTION_KEY, and I ran into some issues. I've tried both indictrans and with microsoft translator subscription ID. With indictrans package, I seem to be running into issues with the ndarray shapes not matching (I think this is an issue with indictrans itself). With the subscription key passed, I get the following traceback:

Failed to import No module named 'indictrans'
./download_data.sh: line 50: wget: command not found
./download_data.sh: line 54: wget: command not found
./download_data.sh: line 59: wget: command not found
unzip:  cannot find or open /Users/iglee/GLUECoS/Data/Original_Data/LID_EN_HI/temp/ICON_POS.zip, /Users/iglee/GLUECoS/Data/Original_Data/LID_EN_HI/temp/ICON_POS.zip.zip or /Users/iglee/GLUECoS/Data/Original_Data/LID_EN_HI/temp/ICON_POS.zip.ZIP.
Traceback (most recent call last):
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_lid_en_hi.py", line 192, in <module>
    main()
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_lid_en_hi.py", line 175, in main
    make_temp_file(original_path)
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_lid_en_hi.py", line 11, in make_temp_file
    shutil.copy(original_path_validation,new_path_validation)
  File "/Users/iglee/opt/anaconda3/lib/python3.9/shutil.py", line 427, in copy
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/Users/iglee/opt/anaconda3/lib/python3.9/shutil.py", line 264, in copyfile
    with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/iglee/GLUECoS/Data/Original_Data/LID_EN_HI/temp//HindiEnglish_FIRE2013_AnnotatedDev.txt'
Downloaded LID EN HI
./download_data.sh: line 98: wget: command not found
Traceback (most recent call last):
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_ner_en_hi.py", line 143, in <module>
    main()
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_ner_en_hi.py", line 107, in main
    make_temp_file(original_path)
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_ner_en_hi.py", line 11, in make_temp_file
    with open(original_path +'/annotatedData.csv','r',encoding='utf-8')as f:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/iglee/GLUECoS/Data/Original_Data/NER_EN_HI/temp//annotatedData.csv'
Downloaded NER EN HI
./download_data.sh: line 221: wget: command not found
./download_data.sh: line 224: wget: command not found
./download_data.sh: line 227: wget: command not found
./download_data.sh: line 230: wget: command not found
./download_data.sh: line 233: wget: command not found
./download_data.sh: line 236: wget: command not found
./download_data.sh: line 239: wget: command not found
./download_data.sh: line 242: wget: command not found
./download_data.sh: line 245: wget: command not found
./download_data.sh: line 248: wget: command not found
./download_data.sh: line 251: wget: command not found
./download_data.sh: line 254: wget: command not found
./download_data.sh: line 257: wget: command not found
./download_data.sh: line 260: wget: command not found
./download_data.sh: line 263: wget: command not found
./download_data.sh: line 266: wget: command not found
./download_data.sh: line 269: wget: command not found
./download_data.sh: line 272: wget: command not found
./download_data.sh: line 275: wget: command not found
./download_data.sh: line 278: wget: command not found
./download_data.sh: line 281: wget: command not found
./download_data.sh: line 284: wget: command not found
./download_data.sh: line 287: wget: command not found
./download_data.sh: line 290: wget: command not found
./download_data.sh: line 293: wget: command not found
./download_data.sh: line 296: wget: command not found
./download_data.sh: line 299: wget: command not found
./download_data.sh: line 302: wget: command not found
./download_data.sh: line 305: wget: command not found
./download_data.sh: line 308: wget: command not found
./download_data.sh: line 311: wget: command not found
./download_data.sh: line 314: wget: command not found
./download_data.sh: line 317: wget: command not found
./download_data.sh: line 320: wget: command not found
./download_data.sh: line 323: wget: command not found
./download_data.sh: line 326: wget: command not found
./download_data.sh: line 329: wget: command not found
./download_data.sh: line 332: wget: command not found
./download_data.sh: line 335: wget: command not found
./download_data.sh: line 338: wget: command not found
./download_data.sh: line 341: wget: command not found
./download_data.sh: line 344: wget: command not found
./download_data.sh: line 347: wget: command not found
./download_data.sh: line 350: wget: command not found
./download_data.sh: line 353: wget: command not found
./download_data.sh: line 356: wget: command not found
./download_data.sh: line 359: wget: command not found
./download_data.sh: line 362: wget: command not found
./download_data.sh: line 365: wget: command not found
./download_data.sh: line 368: wget: command not found
./download_data.sh: line 371: wget: command not found
./download_data.sh: line 374: wget: command not found
./download_data.sh: line 377: wget: command not found
./download_data.sh: line 380: wget: command not found
./download_data.sh: line 383: wget: command not found
./download_data.sh: line 386: wget: command not found
Traceback (most recent call last):
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_pos_en_es.py", line 104, in <module>
    main()
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_pos_en_es.py", line 90, in main
    make_split_file(id_dir+'/train_ids.txt','temp_word.txt',new_path+'/train.txt',mode='train')
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_pos_en_es.py", line 45, in make_split_file
    with open(input_file,'r') as infile:
FileNotFoundError: [Errno 2] No such file or directory: 'temp_word.txt'
Downloaded POS EN ES
./download_data.sh: line 140: wget: command not found
unzip:  cannot find or open /Users/iglee/GLUECoS/Data/Original_Data/POS_EN_HI_FG/temp/ICON_POS.zip, /Users/iglee/GLUECoS/Data/Original_Data/POS_EN_HI_FG/temp/ICON_POS.zip.zip or /Users/iglee/GLUECoS/Data/Original_Data/POS_EN_HI_FG/temp/ICON_POS.zip.ZIP.
Traceback (most recent call last):
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_pos_en_hi_fg.py", line 63, in <module>
    main()
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_pos_en_hi_fg.py", line 43, in main
    shutil.copy(original_path+'Romanized/train.txt',new_path+'Romanized/train.txt')
  File "/Users/iglee/opt/anaconda3/lib/python3.9/shutil.py", line 427, in copy
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/Users/iglee/opt/anaconda3/lib/python3.9/shutil.py", line 264, in copyfile
    with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/iglee/GLUECoS/Data/Original_Data/POS_EN_HI_FG/temp/ICON_POS/Processed Data/Romanized/train.txt'
Downloaded POS EN HI FG
./download_data.sh: line 171: wget: command not found
unzip:  cannot find or open /Users/iglee/GLUECoS/Data/Original_Data/Sentiment_EN_HI/temp/SAIL_2017.zip, /Users/iglee/GLUECoS/Data/Original_Data/Sentiment_EN_HI/temp/SAIL_2017.zip.zip or /Users/iglee/GLUECoS/Data/Original_Data/Sentiment_EN_HI/temp/SAIL_2017.zip.ZIP.
Traceback (most recent call last):
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_sent_en_hi.py", line 63, in <module>
    main()
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_sent_en_hi.py", line 43, in main
    shutil.copy(original_path+'Romanized/train.txt',new_path+'Romanized/train.txt')
  File "/Users/iglee/opt/anaconda3/lib/python3.9/shutil.py", line 427, in copy
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/Users/iglee/opt/anaconda3/lib/python3.9/shutil.py", line 264, in copyfile
    with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/iglee/GLUECoS/Data/Original_Data/Sentiment_EN_HI/temp/SAIL_2017/Processed Data/Romanized/train.txt'
Downloaded Sentiment EN HI
./download_data.sh: line 187: wget: command not found
Downloaded QA EN HI
./download_data.sh: line 113: wget: command not found
unzip:  cannot find or open /Users/iglee/GLUECoS/Data/Original_Data/POS_EN_HI_UD/temp/master.zip, /Users/iglee/GLUECoS/Data/Original_Data/POS_EN_HI_UD/temp/master.zip.zip or /Users/iglee/GLUECoS/Data/Original_Data/POS_EN_HI_UD/temp/master.zip.ZIP.
(Patch is indented 4 spaces.)
patch: **** Can't find file /Users/iglee/GLUECoS/Data/Original_Data/POS_EN_HI_UD/temp/UD_Hindi_English-master/crawl_tweets.py : No such file or directory
Traceback (most recent call last):
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_pos_en_hi_ud.py", line 179, in <module>
    main()
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_pos_en_hi_ud.py", line 164, in main
    scrape_tweets(original_path)
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_pos_en_hi_ud.py", line 16, in scrape_tweets
    os.chdir(original_path)
FileNotFoundError: [Errno 2] No such file or directory: '/Users/iglee/GLUECoS/Data/Original_Data/POS_EN_HI_UD/temp/UD_Hindi_English-master'
Downloaded POS EN HI UD
./download_data.sh: line 200: wget: command not found
unzip:  cannot find or open /Users/iglee/GLUECoS/Data/Original_Data/NLI_EN_HI/temp/all_keys_json.zip, /Users/iglee/GLUECoS/Data/Original_Data/NLI_EN_HI/temp/all_keys_json.zip.zip or /Users/iglee/GLUECoS/Data/Original_Data/NLI_EN_HI/temp/all_keys_json.zip.ZIP.
./download_data.sh: line 207: wget: command not found
./download_data.sh: line 207: wget: command not found
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/iglee/opt/anaconda3/lib/python3.9/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/Users/iglee/opt/anaconda3/lib/python3.9/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/Users/iglee/opt/anaconda3/lib/python3.9/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/Users/iglee/opt/anaconda3/lib/python3.9/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Traceback (most recent call last):
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_nli_en_hi.py", line 125, in <module>
    main()
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_nli_en_hi.py", line 113, in main
    process_files(original_path+'all_keys_json/Final_Key.json',args.data_dir+'/NLI_EN_HI/temp/all_only_id.json')
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_nli_en_hi.py", line 12, in process_files
    with open(final_key_path,'r') as infile:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/iglee/GLUECoS/Data/Original_Data/NLI_EN_HI/temp/all_keys_json/Final_Key.json'
Downloaded NLI EN HI
./download_data.sh: line 156: wget: command not found
Traceback (most recent call last):
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_sent_en_es.py", line 218, in <module>
    main()
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_sent_en_es.py", line 197, in main
    download_tweets(tweet_keys,original_path)
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_sent_en_es.py", line 12, in download_tweets
    lines = [line.strip() for line in open(original_path_text,'r').readlines()]
FileNotFoundError: [Errno 2] No such file or directory: '/Users/iglee/GLUECoS/Data/Original_Data/Sentiment_EN_ES/temp//cs-en-es-corpus-wassa2015.txt'
Downloaded Sentiment EN ES
./download_data.sh: line 26: wget: command not found
./download_data.sh: line 29: wget: command not found
./download_data.sh: line 33: wget: command not found
unzip:  cannot find or open /Users/iglee/GLUECoS/Data/Original_Data/LID_EN_ES/temp/Release.zip, /Users/iglee/GLUECoS/Data/Original_Data/LID_EN_ES/temp/Release.zip.zip or /Users/iglee/GLUECoS/Data/Original_Data/LID_EN_ES/temp/Release.zip.ZIP.
Traceback (most recent call last):
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_lid_en_es.py", line 168, in <module>
    main()
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_lid_en_es.py", line 139, in main
    download_tweets(original_path)
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_lid_en_es.py", line 13, in download_tweets
    shutil.copy('twitter_authentication.txt',original_path+'/Release/twitter_auth.txt')
  File "/Users/iglee/opt/anaconda3/lib/python3.9/shutil.py", line 427, in copy
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/Users/iglee/opt/anaconda3/lib/python3.9/shutil.py", line 266, in copyfile
    with open(dst, 'wb') as fdst:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/iglee/GLUECoS/Data/Original_Data/LID_EN_ES/temp//Release/twitter_auth.txt'
Downloaded LID EN ES
./download_data.sh: line 75: wget: command not found
./download_data.sh: line 78: wget: command not found
./download_data.sh: line 82: wget: command not found
unzip:  cannot find or open /Users/iglee/GLUECoS/Data/Original_Data/NER_EN_ES/temp/Release.zip, /Users/iglee/GLUECoS/Data/Original_Data/NER_EN_ES/temp/Release.zip.zip or /Users/iglee/GLUECoS/Data/Original_Data/NER_EN_ES/temp/Release.zip.ZIP.
Traceback (most recent call last):
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_ner_en_es.py", line 152, in <module>
    main()
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_ner_en_es.py", line 134, in main
    download_tweets(original_path)
  File "/Users/iglee/GLUECoS/Data/Preprocess_Scripts/preprocess_ner_en_es.py", line 13, in download_tweets
    shutil.copy('twitter_authentication.txt',original_path+'/Release/twitter_auth.txt')
  File "/Users/iglee/opt/anaconda3/lib/python3.9/shutil.py", line 427, in copy
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/Users/iglee/opt/anaconda3/lib/python3.9/shutil.py", line 266, in copyfile
    with open(dst, 'wb') as fdst:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/iglee/GLUECoS/Data/Original_Data/NER_EN_ES/temp//Release/twitter_auth.txt'
Downloaded NER EN ES

I'm wondering why it's still trying to use indictrans despite my passing the subscription key? If someone could help me with this, I'd really appreciate it. thanks!

Guruprasad68 commented 1 year ago

I followed the README, and installed indictrans according to https://github.com/libindic/indic-trans#clone--install. It works fine. However, many tweets don't seem to exist in my case.