jeniyat / StackOverflowNER

Source Code and Data for Software Domain NER
MIT License
145 stars 37 forks source link

Unable to download data #3

Open thefirebanks opened 3 years ago

thefirebanks commented 3 years ago

Hello! First of all, amazing work. I'm looking to play around with the BERT NER model, and following the prerequisite steps I tried downloading data_ctc.zip using Mega. However, I wasn't able to fully download it because I had exceeded my download quota. Is there an alternative way of hosting the data, like a google drive link? Or, is there a way of accessing a pretrained model that we can use directly for predictions on new data? Thank you in advance!

jeniyat commented 3 years ago

Ah....I did not know megazip has some download limit. I will try to find some alternate free sources to upload these big files.

borijang commented 3 years ago

Hey, I also have the same problem. Any updates with this issue? Thanks.

jeniyat commented 3 years ago

not yet! we are trying to make some upgraded version for this, which will not require downloading these larger files. The estimated release date is: end of Jan

borijang commented 3 years ago

Alright. I managed to download it with the client, but got a "decryption error" right at the end. This happened twice. I was reading more about this error and it is most likely a problem with the upload itself. I will be very grateful if you could re-upload data_ctc.zip anywhere in the meantime. Thanks again.

cuevasclemente commented 3 years ago

I am also running into this issue. I wonder if there are other alternatives that people could use to host the file? If I could get access to the file I would be happy to temporarily host it via my personal Google Drive until you could come up with a solution to hosting the file that you were satisfied with. I currently don't have access to the file though.

cuevasclemente commented 3 years ago

Hey, if you're having trouble getting access to the file, I wrote in to MegaUpload support and got the following response:

Thank you for your support and for using MEGA.

The "Decryption error" means that the file in MEGA was not properly encrypted and now it can't be decrypted when it's downloaded. The problem really happened during the upload of those files, rather than during the download.

Those errors are most likely caused by modifications of the data of files in the network during uploads. That means that our apps could correctly encrypt the data, but it has been somehow modified during the transfer to MEGA storage servers (due to faulty connections or defective network devices for example).

You can try to "recover" those files by opening a web browser Javascript console, inserting the command "skipcheck=1" and clicking Enter. Then start the download.

STEPS:

Open your MEGA account in a browser. Just before you are about to click 'Download' of the affected file press F12 (Option + cmd + J) to open your JavaScript Console. The screen will divide. Type 'skipcheck=1' in the bottom line of the Console, click 'Enter' and then select the 'Download' in your MEGA account window.

The download will be completed, but it is possible that the file will be corrupt. If so please remove it from your account as it will never be decryptable.

cuevasclemente commented 3 years ago

Hey, I was able to download the file following the instructions of the MEGA support person above. Following those instructions (setting skipcheck=1 in their browser's javascript console) should let individuals download the dataset. I am still happy to host the file on my personal Google Drive as well for as long as would be desired as that might be an easier interface.

borijang commented 3 years ago

Hey @cuevasclemente, I managed to download the archive thanks to your instructions. However, unzipping it yields an error for the data/no_eng_char_uniq.bin file. Did you have such a problem? It doesn't seem that we can find a way to make this work given a corrupted zip.

cuevasclemente commented 3 years ago

I didn't have any problems unzipping. I can share you a private link to a google drive uploaded version of the zip that I have, but it's possible that despite not getting an error there could still be something wrong. Maybe try this? Send an email to this temporary email address: volamih426@econeom.com (I don't like posting my personal email address on publicly accessible websites) and I'll send you a link to a google drive download link for the archive.

cuevasclemente commented 3 years ago

Hi, if you sent an email I didn't get it (and I no longer have access to that email account). Maybe try sending me a direct message on Twitter: @CuevasClemente

cuevasclemente commented 3 years ago

Hi @jeniyat, I'm wondering if you are still planning on making an upgraded version of this? I have been trying to use this model for a while but it has been a bit tricky trying to get it run (I still have not personally resolved this issue: https://github.com/jeniyat/StackOverflowNER/issues/5) and the model checkpoint that is on huggingface shows this message when I try to use it returns the following message:

Some weights of BertForTokenClassification were not initialized from the model checkpoint at jeniya/BERTOverflow and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

I am grateful that you released the model, code, and training data to the community and if there's anything I can do to help get it working I would be happy to do so.

KareemAlaa2001 commented 3 years ago

Hi, I am also having this problem. @jeniyat it would be great if you could host it on a different platform for everyone. In the meantime, @cuevasclemente could you please send what you have over? I'll DM you on twitter.

mlejva commented 3 years ago

Hello @cuevasclemente, I'm also trying to download the data that aren't accessible anymore. I DM-ed you on Twitter as @mlejva. Could you please share a link to the data with me? Thanks!

RebeWu commented 3 years ago

@cuevasclemente Hi, I am a newer in this areas, and i am try to under this work, can you share a link to the data with me either? Thanks very much! My email is : wumnghan@163.com

jeniyat commented 3 years ago

The resources can be found here: https://drive.google.com/drive/folders/1iEEMr2DYofulK2F5pSErOPf5ggrEqtJt?usp=sharing

RebeWu commented 3 years ago

可在以下位置找到资源:https : //drive.google.com/drive/folders/1iEEMr2DYofulK2F5pSErOPf5ggrEqtJt?usp=sharing

Thanks very much

vthanhquang commented 3 years ago

I am a student and our group is trying to run the code to conduct research. Unfortunately, we cannot download two files utils_fine_tune.zip and data_ctc.zip that you mention on the README on 'code' folder. Can you kindly provide a new link or send these files via my email: vthanhquang72882@gmail.com, so I can help you to upload the files somewhere so other people can access and download as well

Rvlis commented 3 years ago

Hi @cuevasclemente, I am sorry that I don't have a Twitter account, and I still can not download the .zip files from the authors' google drive link for the error The download file will exceed the limit, so it cannot be downloaded at this time. Could you share a copy with me in your google drive link? My email address is site.rvli@nuaa.edu.cn Thank you very much.

jeniyat commented 3 years ago

all the data/ resources are available here: https://drive.google.com/drive/folders/1iEEMr2DYofulK2F5pSErOPf5ggrEqtJt?usp=sharing

zen93 commented 3 years ago

@jeniyat Hi, I am interested in using your project for my master's thesis. I have a suggestion for you: large files (upto 50 GB) can be hosted on https://zenodo.org/ As far as I know, there doesn't seem to be a download limit restriction unlike google and mega. Hope this helps! Thanks for open-sourcing this model!