Closed satyamtg closed 3 years ago
Considering that Kolibri2zim is a new scraper, I would be strongly disappointed if anything would be temporary written to the hard disk.
Discussed this in the meeting and it’s a libzim thing. There are actually two ways to add binary :
I’ll add an option to scraperlib so we can use filepath and decide whether how it should be handled internally.
On Sun, 25 Oct 2020 at 12:59 Kelson notifications@github.com wrote:
Considering that Kolibri2zim is a new scraper, I would be stringly disappinted if anythin would be temporary written to the hard disk.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/openzim/python_scraperlib/issues/69#issuecomment-716144273, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAOESIUDW5RVVJKJMOIZSTSMQOLDANCNFSM4S6KAO4A .
Considering that Kolibri2zim is a new scraper, I would be strongly disappointed if anything would be temporary written to the hard disk.
We only write to disk the assets at the moment as they are downloaded using save_large_file() from scraperlib. For other things, like articles and stuff, they are written directly to the ZIM.
However, as you suggest, I'll try to change this behaviour shortly.
@rgaudin I'm not sure where the content file reading really happen, in the libzim or in pylibzim. Therefore it is difficult for me to figure out the solution. However, here it seems we have a weak async implementation somewhere. We have also a "zimwriterfs compat" mode in scraperlib which seems to be used here. All of this worries me a bit. "Zimwriterfs compat" should be used only for old scrapers and proper async implementation implies callback/promise or similar pattern. We can have a short discussion about that tomorrow.
I changed kolibri2zim to write binary files too using Creator.add_binary()
, however, it now fails with the following -
File "/home/satyamtg/Projects/kolibri2zim/kolibri2zim_first_implementation/kolibri2zim_test/venv/lib/python3.8/site-packages/kolibri2zim-1.0.0.dev0-py3.8.egg/kolibri2zim/scraper.py", line 113, in add_node_files_to_zim
self.creator.add_binary(url=f"{curr_path}{file_asset_name}", content=file_content)
File "/home/satyamtg/Projects/kolibri2zim/kolibri2zim_first_implementation/kolibri2zim_test/venv/lib/python3.8/site-packages/zimscraperlib/zim/creator.py", line 177, in add_binary
self.add_zim_article(
File "/home/satyamtg/Projects/kolibri2zim/kolibri2zim_first_implementation/kolibri2zim_test/venv/lib/python3.8/site-packages/zimscraperlib/zim/creator.py", line 145, in add_zim_article
super().add_article(article)
File "/home/satyamtg/Projects/kolibri2zim/kolibri2zim_first_implementation/kolibri2zim_test/venv/lib/python3.8/site-packages/libzim/writer.py", line 192, in add_article
self._creatorWrapper.add_article(article)
File "libzim/wrapper.pyx", line 200, in libzim.wrapper.Creator.add_article
RuntimeError: Traceback (most recent call last):
File "libzim/wrapper.pyx", line 110, in libzim.wrapper.string_cy_call_fct
AttributeError: 'NoneType' object has no attribute 'encode'
I also tested manually with other valid bytes
content and yet the same error. Not sure if it's libzim or scraperlib.
I changed kolibri2zim to write binary files too using Creator.add_binary(), however, it now fails with the following -
Do you have the code somewhere ? It seems you are returning a None
object instead of the content.
The next libzim api doesn't have this issue has you must give a ContentProvider
, then you are in charge of reading the file and you can delete it when you want.
I changed kolibri2zim to write binary files too using Creator.add_binary(), however, it now fails with the following -
Do you have the code somewhere ? It seems you are returning a
None
object instead of the content.The next libzim api doesn't have this issue has you must give a
ContentProvider
, then you are in charge of reading the file and you can delete it when you want.
I am using this method and I pass a bytes string to the content.
The error can be a lot of thing. In fact it seems it is not related to the blob (content) anymore. "You" are passing a None object were libzim is expecting a string. It can be any other attribute (which is None by default in python_scraperlib) you've forget to set.
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.
https://github.com/openzim/kolibri2zim/issues/8 is due to the fact that the ZIM Creator API is not working as intended. When we delete a file just after calling
Creator.add_binary()
, pylibzim just crashes with a segmentation fault, which I suppose is due to the fact that it tries to read the file after the call on the filesystem (which might, in itself be a pylibzim bug).I can confirm this as I tested this in scraperlib by modifying the test for adding an image (which is already there on the filesystem) to the ZIM. The crash report is as follows -
The test passes properly is I do not delete just after the call.