polm / unidic-py

Unidic packaged for installation via pip.
MIT License
79 stars 8 forks source link

python3.11 -m unidic download fails due to 403 error from Github #16

Open Plusarc opened 2 weeks ago

Plusarc commented 2 weeks ago
$ python3.11 -m unidic download
✘ Server error (403)
Couldn't fetch dictionary info. If this error persists please open an issue.

curl https://raw.githubusercontent.com/polm/unidic-py/master/dicts.json can get the response correctly. Changed the header of request and it worked, so maybe github is blocking request by header

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
}
requests.get(
    "https://raw.githubusercontent.com/polm/unidic-py/master/dicts.json",
    headers=headers
)

We may need to update the request header to avoid being blocked by github. Thanks.

yoonseung-riiid commented 1 week ago

same issue

polm commented 1 week ago

I can't reproduce this issue, things work as usual for me. Is this error persistent for you, or is it intermittent?

If there has been a change in Github policy that caused this, I need to find alternatives for hosting the file - spoofing the user agent for Github is not a solution.

Plusarc commented 1 week ago

Hi, Polum. Thanks for looking into this. In the aws ec2 instance, it is persistent error. In my local, it works fine so far.

So I doubt github may have new policy to detect the bot/crawler from aws ec2 instance. Yeah, it would be great to have an alternatives for hosting the file. Thanks.

polm commented 1 week ago

Thank you for the extra information that this is on an EC2 instance, that makes sense. I can definitely add a parameter to specify a local file or separate URL.

It might take me a little while to implement this, but I would also be happy to accept a PR.

As a short-term workaround, besides changing your headers, you can change the download URL in the source in your local installation, or rewrite the function where it's used.

yoonseung-riiid commented 1 week ago

i decided to detour and download directly version and url are from https://raw.githubusercontent.com/polm/unidic-py/master/dicts.json stated in the code as polm mentinoed

python -c "import urllib.request; import unidic; import os; from unidic.download import download_and_clean; opener = urllib.request.build_opener(); opener.addheaders = [('User-Agent', 'Mozilla/5.0')]; urllib.request.install_opener(opener); download_and_clean('3.1.0+2021-08-31', 'https://cotonoha-dic.s3-ap-northeast-1.amazonaws.com/unidic-3.1.0.zip')"