upbit / pixivpy

Pixiv API for Python
https://pypi.org/project/PixivPy3/#files
The Unlicense
1.79k stars 148 forks source link

Should pixivpy handle webp images? #103

Open Nachtalb opened 4 years ago

Nachtalb commented 4 years ago

I noticed that Pixiv delivers wrong file extensions sometimes. The URL's it returns sometimes have a scale like this 600x1200_90_webp but the filename of the URL is still a .jpg. Here an example:

I know it's an error in the API Pixiv provides. But because the API endpoints this framework uses is extracted from the Pixiv Android app etc and not official I don't think there is an official place to tell them to fix it.

So my question is, should pixivpy save the image as the file extension in the URL or as the content-type in the RESPONSE headers which give us the correct image type.

upbit commented 4 years ago

Use content-type to determine the suffix name is better, although client/chrome is compatible, some libraries may still determine the file type based on the suffix name.

Maybe we need a mapping? content-type -> file_suffix

Nachtalb commented 4 years ago

although client/chrome is compatible

Yeah because it already looks for the content type and not the extension XD.

Maybe we need a mapping? content-type -> file_suffix

I can do this if you want. One question remains, however. Is this the only case or do you know about others as well, like serving as png or so? Depending on that the size of the mapping varies. Either only webp and jpeg or more 😕

eg.

content_type_mapping = {
    'image/webp': 'webp',
    'image/jpg': 'jpg',
    'image/jpeg': 'jpg',
    'image/png': 'png',
    'image/gif': 'gif',
}

To get a list of most mmie types <=> filetype mapping for images you can run this commmand:

wget -qO- http://svn.apache.org/repos/asf/httpd/httpd/trunk/docs/conf/mime.types | egrep -v ^# | awk '{ for (i=2; i<=NF; i++) {print $i" "$1}}' | sort | grep 'image/'
upbit commented 4 years ago
  1. If exist in content_type_mapping, use the suffix
  2. If not exist, use the url filename suffix (as default)
Nachtalb commented 4 years ago

If exist in content_type_mapping, use the suffix If not exist, use the url filename suffix (as default)

I'll do that

Nachtalb commented 4 years ago

ATM we check if the file already exists. If the user does not give a name for the file we infer the name from the URL. When the user has not given the replace flag and the files already exists we stop and return False. https://github.com/upbit/pixivpy/blob/3ed6851da14acca3e34ca4587afe05549252c0c1/pixivpy3/api.py#L152

With inferring the name from the content-type we need to download the file before we can check that if it already exists on the file system. Is that ok with you? The implication is that we may have to wait for the download just to tell the user the file already exists. :/

To mitigate that we could infer the file extension from the given scale but I am not sure if there are any other scales like WIDTHxHEIGHT_webp.

upbit commented 4 years ago

As you think, if we need to download the file and then determine the file type, it may not be suitable for putting together (the download() API has been released, it is not easy to modify the parameters)

Can we consider adding an API (eg: downloadWithExt(url, pattern)) to complete this operation?

Mikubill commented 4 years ago

An auto_ext parameter is used in pixivpy_async to allow the user to force the file name, otherwise the extension will be changed / added automatically: https://github.com/Mikubill/pixivpy-async/blob/master/pixivpy_async/bapi.py

if auto_ext and type in self.content_type_mapping:
    _ext = re.findall(r'(\.\w+)$', img_path)[0]
    img_path = img_path.replace(_ext, self.content_type_mapping[type])
Nachtalb commented 4 years ago

@Mikubill I wouldn't use regex to change the extension but rather this: https://docs.python.org/3/library/os.path.html#os.path.splitext

Bakutomo commented 4 years ago

You don't need to download the whole image to read the content-type header. If you set stream=True when making your request, only the headers will be downloaded immediately. You can then determine the full file name and check whether it exists before proceeding with the rest of the download. See the relevant section of the requests documentation.

So you'd do something like:

with self.requests_call('GET', url, stream=True) as response:
    content_type = response.headers.get('content-type', None)
    if content_type in content_type_mapping:
         ext = content_type_mapping[content_type]
    else:
         ext = # determine ext from the URL
    img_path = os.path.join(path, name + ext)
    if os.path.exists(img_path) and not replace:
         return False
    with open(img_path, 'wb') as out_file:
         shutil.copyfileobj(response.raw, out_file)
return True
NewUserHa commented 1 year ago

The 'file extension name' or 'content-type' can neither represent the real file type. You should read the first bytes and use the magic number to decide the real file type.

BTW, The img-master should be the resized image rather than the original, isn't it?