microsoft / LMOps

General technology for enabling AI capabilities w/ LLMs and MLLMs
https://aka.ms/GeneralAI
MIT License
3.71k stars 283 forks source link

[MiniLLM] Processed RoBERTa Corpus dataset download #211

Closed AKaubay closed 2 months ago

AKaubay commented 7 months ago

Unable to download processed RoBERTa Corpus only. Also encountering repeated interruptions during download of the full processed_data.tar with an error indicating dead links, possibly due to incomplete or corrupt file structure in the compressed archive. I also tried to download it from my personal computer with a 10 Mbps connection and still encountered the same problem. error 1 error 2

t1101675 commented 6 months ago

It works fine in our environment. Is the download started by running the following commands?

DLINK=$(echo -n "aHR0cHM6Ly9jb252ZXJzYXRpb25odWIuYmxvYi5jb3JlLndpbmRvd3MubmV0L2JlaXQtc2hhcmUtcHVibGljL01pbmlMTE0vcHJvY2Vzc2VkX2RhdGEudGFyP3N2PTIwMjMtMDEtMDMmc3Q9MjAyNC0wNC0xMFQxMyUzQTExJTNBNDRaJnNlPTIwNTAtMDQtMTFUMTMlM0ExMSUzQTAwWiZzcj1jJnNwPXImc2lnPTRjWEpJalZSWkhJQldxSGpQZ0RuJTJGMDFvY3pwRFdYaXBtUENVazNaOHZiUSUzRA==" | base64 --decode)
wget -O processed_data.tar $DLINK
donglixp commented 2 months ago

https://github.com/microsoft/LMOps/blob/main/minillm/README.md

The links have been updated.