Open Nachtalb opened 4 years ago
Use content-type to determine the suffix name is better, although client/chrome is compatible, some libraries may still determine the file type based on the suffix name.
Maybe we need a mapping? content-type -> file_suffix
although client/chrome is compatible
Yeah because it already looks for the content type and not the extension XD.
Maybe we need a mapping? content-type -> file_suffix
I can do this if you want. One question remains, however. Is this the only case or do you know about others as well, like serving as png or so? Depending on that the size of the mapping varies. Either only webp and jpeg or more 😕
eg.
content_type_mapping = {
'image/webp': 'webp',
'image/jpg': 'jpg',
'image/jpeg': 'jpg',
'image/png': 'png',
'image/gif': 'gif',
}
To get a list of most mmie types <=> filetype mapping for images you can run this commmand:
wget -qO- http://svn.apache.org/repos/asf/httpd/httpd/trunk/docs/conf/mime.types | egrep -v ^# | awk '{ for (i=2; i<=NF; i++) {print $i" "$1}}' | sort | grep 'image/'
content_type_mapping
, use the suffixIf exist in content_type_mapping, use the suffix If not exist, use the url filename suffix (as default)
I'll do that
ATM we check if the file already exists. If the user does not give a name for the file we infer the name from the URL. When the user has not given the replace flag and the files already exists we stop and return False. https://github.com/upbit/pixivpy/blob/3ed6851da14acca3e34ca4587afe05549252c0c1/pixivpy3/api.py#L152
With inferring the name from the content-type we need to download the file before we can check that if it already exists on the file system. Is that ok with you? The implication is that we may have to wait for the download just to tell the user the file already exists. :/
To mitigate that we could infer the file extension from the given scale but I am not sure if there are any other scales like WIDTHxHEIGHT_webp
.
As you think, if we need to download the file and then determine the file type, it may not be suitable for putting together (the download()
API has been released, it is not easy to modify the parameters)
Can we consider adding an API (eg: downloadWithExt(url, pattern)
) to complete this operation?
An auto_ext parameter is used in pixivpy_async to allow the user to force the file name, otherwise the extension will be changed / added automatically: https://github.com/Mikubill/pixivpy-async/blob/master/pixivpy_async/bapi.py
if auto_ext and type in self.content_type_mapping:
_ext = re.findall(r'(\.\w+)$', img_path)[0]
img_path = img_path.replace(_ext, self.content_type_mapping[type])
@Mikubill I wouldn't use regex to change the extension but rather this: https://docs.python.org/3/library/os.path.html#os.path.splitext
You don't need to download the whole image to read the content-type header. If you set stream=True when making your request, only the headers will be downloaded immediately. You can then determine the full file name and check whether it exists before proceeding with the rest of the download. See the relevant section of the requests documentation.
So you'd do something like:
with self.requests_call('GET', url, stream=True) as response:
content_type = response.headers.get('content-type', None)
if content_type in content_type_mapping:
ext = content_type_mapping[content_type]
else:
ext = # determine ext from the URL
img_path = os.path.join(path, name + ext)
if os.path.exists(img_path) and not replace:
return False
with open(img_path, 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)
return True
The 'file extension name' or 'content-type' can neither represent the real file type. You should read the first bytes and use the magic number to decide the real file type.
BTW,
The img-master
should be the resized image rather than the original, isn't it?
I noticed that Pixiv delivers wrong file extensions sometimes. The URL's it returns sometimes have a scale like this
600x1200_90_webp
but the filename of the URL is still a.jpg
. Here an example:webp
scale butjpg
extension: https://i.pximg.net/c/600x1200_90_webp/img-master/img/2017/11/09/13/43/30/65814810_p0_master1200.jpgwebp
: https://i.pximg.net/c/600x1200_90/img-master/img/2017/11/09/13/43/30/65814810_p0_master1200.jpgI know it's an error in the API Pixiv provides. But because the API endpoints this framework uses is extracted from the Pixiv Android app etc and not official I don't think there is an official place to tell them to fix it.
So my question is, should pixivpy save the image as the file extension in the URL or as the
content-type
in the RESPONSE headers which give us the correct image type.