happypandax / plugins

Plugins for HappyPanda X
https://happypandax.github.io/
GNU Lesser General Public License v3.0
32 stars 16 forks source link

File Metadata error when parsing HDoujin Downloader's info.json files inside zip files #40

Open Dystasia opened 3 years ago

Dystasia commented 3 years ago

File Metadata parser fails for info.json files generated from HDoujin Downloader when inside zip files. Same info.json when extracted parses with no issues whatsoever.

Here is the plugin.log:

Sep-09 00:16:49--INFO pluginctx.file-metadata.main: Attempting with DataType.eze Sep-09 00:16:49--WARNING pluginctx.file-metadata.extractors.common: An error occured while trying to parse file into a dict Sep-09 00:16:49--INFO pluginctx.file-metadata.main: Skipping DataType.eze Sep-09 00:16:49--INFO pluginctx.file-metadata.main: Attempting with DataType.hdoujin Sep-09 00:16:49--WARNING pluginctx.file-metadata.extractors.common: An error occured while trying to parse file into a dict Sep-09 00:16:49--INFO pluginctx.file-metadata.main: Skipping DataType.hdoujin

Let me know if you need an exmaple, but really this is happening with all my files.

Dystasia commented 3 years ago

Actually, it is not all of them. I am trying to identify the differences but I am guessing it has something to do with the structure of some info files.

Dystasia commented 3 years ago

Ok I found the issue. It has something to do with special characters when zipped. This Json works when unzipped but not when zipped:

zatsuna commented 3 years ago

@Dystasia I only have zip and rar files. I did some testing and here's what I found out. The File Metadata plugin finds and successfully adds tags but only if the folder is unzipped. I don't have any unzipped galleries, so I didn't notice this before. I have many .zip galleries and none works with File Metadata. It worked fine with .zip galleries in HPX from a year before.

Also, I don't get duplicate galleries with unzipped folders when scanning for new galleries. If galleries are zipped, I always get duplicates of every gallery regardless of "Scan only for new galleries" option being selected. Every scan adds another duplicate.

These two issues are probably related to each other as they both are solved by unzipping.

Dystasia commented 3 years ago

Just an update of how I attempted to fix this.

First, the exception actually thrown when trying to parse is: 'charmap' codec can't decode byte 0x9d in position 314: character maps to <undefined>

This probably means, the reading of the file is happening without utf-8 encoding.

The reading and parsing of the file is happening in: https://github.com/happypandax/plugins/blob/6472a37cf6914fa32f99069724fa09fa324ddd95/plugins/File%20Metadata/extractors/common.py#L85-L86

even tho the encoding seems to get set at: https://github.com/happypandax/plugins/blob/6472a37cf6914fa32f99069724fa09fa324ddd95/plugins/File%20Metadata/extractors/common.py#L82-L83 this doesn't seem to work for compressed info.json files. Attempting to remove the if condition I get the exception: open() got an unexpected keyword argument 'encoding'

I can't see the content of hpx.command.CoreFS even tho the documentation states it is a file handler/wrapper, so I'm kinda stuck on not knowing the interface of this class or how to try and force the encoding in another way.

@twiddli have any inputs? is this something that needs to be fixed in hpx core instead of the plugin?

twiddli commented 3 years ago

Hello, thank you guys for the troubleshooting. This is such a weird issue as I still can't repro it yet. Creating a zip file with an info.json with the contents:

{ "manga_info": { "title": "Bad Girl", "original_title": "", "author": [], "artist": [ "INAGO" ], "circle": [], "scanlator": [], "translator": [], "publisher": "FAKKU", "description": "It’s because I’m a good student…that I need some stimulation. ❤", "status": "", "chapters": "N/A", "pages": 20, "tags": { "Misc": [ "Schoolgirl Outfit", "Creampie", "Deepthroat", "Exhibitionism", "Glasses", "Hentai", "Humiliation", "Loli", "Masturbation", "Teacher", "Toys", "Uncensored", "X-Ray" ] }, "type": "", "language": [ "English" ], "released": "", "reading_direction": "", "characters": [], "series": "", "parody": [ "Original Work" ], "url": "https://hentainexus.com/read/6019" } } 

works totally fine, I even put the character ❤ in the filename for good measure and got no issues.

Can you check if the file is utf-8 encoded?

Also, for more insight on what's happening on that line of code, it checks if the file is inside the archive and omits specifying the encoding because the archive handler from the std lib doesn't accept an encoding parameter when opening files from inside the archive. I think this is because it is assumed the encoding is utf-8.

Saving the info.json file inside the archive with a different encoding than utf-8, I get this error: 'CP_UTF8' codec can't decode bytes in position 0--1: No mapping for the Unicode character exists in the target code page. suggesting that it expects utf-8 for all text files.

zatsuna commented 3 years ago

All my files generated by E-Hentai Downloader have a UTF-8 info.txt.

Sample info file: info.txt