openzim / sotoki

StackExchange websites to ZIM scraper
https://library.kiwix.org/?category=stack_exchange
GNU General Public License v3.0
216 stars 25 forks source link

Upgrade python-scraperlib to 3.X (implement #290) #292

Closed IMayBeABitShy closed 3 months ago

IMayBeABitShy commented 5 months ago

(PR depends on https://github.com/openzim/python-scraperlib/pull/126 and will likely need at least one revision due to some feedback being needed)

This PR upgrades python-scraperlib to 3.x while also adding a --long-description argument. This implements #290.

Changes:

Test status:

Comments:

Edit:

This PR fixes #290

benoit74 commented 5 months ago

I will do a full review tomorrow, but it looks like a very good job.

Until the whole review, I can already tell that you will have to use the compute_descriptions utility function:

This function takes care of doing some "intelligent" logic based on these three values and we would like to share this logic across all scrapers.

benoit74 commented 5 months ago

The favicon/Illustration_48x48@1 needed a rework, as it now needs to be specified when calling .config_metadata(). I had to move the conversion for the 48x48 image to the creator setup, but this led to some duplicate code and is a IMO rather ugly solution. Some feedback on how to refactor the code to avoid this would be appreciated.

Agreed. @rgaudin is there any reason to not include the Illustration_96x96_at_1 as an optional in config_metadata of scraperlib ?

Otherwise LGTM

rgaudin commented 5 months ago

Agreed. @rgaudin is there any reason to not include the Illustration_96x96_at_1 as an optional in config_metadata of scraperlib ?

config_metadata is for the common metadata that we set in scrapers. Illustration_96x96@1 is not. It can be set later as for any other metadata.

Once we have a strategy or support in readers for other illustrations, we might reconsider but at the moment, that 96x96 illustration is not usable. I was enthusiastic we'd add support in readers along with the libzim feature but it's not considered a priority. I suggest we simply remove it from sotoki 🙁

IMayBeABitShy commented 5 months ago

Should I remove the code for Illustration_96x96@1? Also, perhaps it would be a good idea to implement shared logic for illustration handling in python-zimscraperlib? For example, we could write a method to only pass a file path to the zimscraperlib and it could decide which illustrations to include, handling resize and format conversions as necessary.

rgaudin commented 5 months ago

Should I remove the code for Illustration_96x96@1?

Yes

Also, perhaps it would be a good idea to implement shared logic for illustration handling in python-zimscraperlib? For example, we could write a method to only pass a file path to the zimscraperlib and it could decide which illustrations to include, handling resize and format conversions as necessary.

There's already something that handles part of this I believe using fpath or URL. Please check it out and open a ticket to improve/replace/add something.

benoit74 commented 4 months ago

@IMayBeABitShy will you be able to finish this soon or should I take this over? I do not mind, I know this is only a side-project for you, it is just that we plan to make a release soon and if we could have this as well in the release, it would be great. We would like to do the release this week ideally, but we have already waited many months to do a release, so we can wait a bit more if needed.

IMayBeABitShy commented 3 months ago

@IMayBeABitShy will you be able to finish this soon or should I take this over? I do not mind, I know this is only a side-project for you, it is just that we plan to make a release soon and if we could have this as well in the release, it would be great. We would like to do the release this week ideally, but we have already waited many months to do a release, so we can wait a bit more if needed.

Sorry about that, I've forgotten that I hadn't already pushed the last commit and encountered a problem with a library, which prevented me from verifying the changes. I've pushed the last commit, which removes the extra illustration, now.

IMayBeABitShy commented 3 months ago

I've implemented the requested changes.

This is probably wrong, you need 3.3.0 to have https://github.com/openzim/python-scraperlib/pull/126

Correct. Seems like I've missed this during the last update, there wasn't a 3.3.0 version yet when I've opened this PR. Fixed.