openzim / python-scraperlib

Collection of Python code to re-use across Python-based scrapers
GNU General Public License v3.0
18 stars 16 forks source link

ZIM creator API not working properly #69

Closed satyamtg closed 3 years ago

satyamtg commented 3 years ago

https://github.com/openzim/kolibri2zim/issues/8 is due to the fact that the ZIM Creator API is not working as intended. When we delete a file just after calling Creator.add_binary(), pylibzim just crashes with a segmentation fault, which I suppose is due to the fact that it tries to read the file after the call on the filesystem (which might, in itself be a pylibzim bug).

I can confirm this as I tested this in scraperlib by modifying the test for adding an image (which is already there on the filesystem) to the ZIM. The crash report is as follows -

tests/zim/test_zim_creator.py::test_zim_creator Fatal Python error: Aborted

Thread 0x00007f3edaf77740 (most recent call first):
  File "/home/satyamtg/Projects/python_scraperlib/python_scraperlib_test_zim_creation/python_scraperlib/venv/lib/python3.8/site-packages/libzim-0.0.3.post0-py3.8-linux-x86_64.egg/libzim/writer.py", line 229 in close
  File "/home/satyamtg/Projects/python_scraperlib/python_scraperlib_test_zim_creation/python_scraperlib/venv/lib/python3.8/site-packages/zimscraperlib-1.3.5.dev0-py3.8.egg/zimscraperlib/zim/creator.py", line 306 in close
  File "/home/satyamtg/Projects/python_scraperlib/python_scraperlib_test_zim_creation/python_scraperlib/venv/lib/python3.8/site-packages/libzim-0.0.3.post0-py3.8-linux-x86_64.egg/libzim/writer.py", line 175 in __exit__
  File "/home/satyamtg/Projects/python_scraperlib/python_scraperlib_test_zim_creation/python_scraperlib/tests/zim/test_zim_creator.py", line 61 in test_zim_creator
  File "/home/satyamtg/Projects/python_scraperlib/python_scraperlib_test_zim_creation/python_scraperlib/venv/lib/python3.8/site-packages/_pytest/python.py", line 184 in pytest_pyfunc_call
  File "/home/satyamtg/Projects/python_scraperlib/python_scraperlib_test_zim_creation/python_scraperlib/venv/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/satyamtg/Projects/python_scraperlib/python_scraperlib_test_zim_creation/python_scraperlib/venv/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/satyamtg/Projects/python_scraperlib/python_scraperlib_test_zim_creation/python_scraperlib/venv/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/satyamtg/Projects/python_scraperlib/python_scraperlib_test_zim_creation/python_scraperlib/venv/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/satyamtg/Projects/python_scraperlib/python_scraperlib_test_zim_creation/python_scraperlib/venv/lib/python3.8/site-packages/_pytest/python.py", line 1627 in runtest
  File "/home/satyamtg/Projects/python_scraperlib/python_scraperlib_test_zim_creation/python_scraperlib/venv/lib/python3.8/site-packages/_pytest/runner.py", line 163 in pytest_runtest_call
  File "/home/satyamtg/Projects/python_scraperlib/python_scraperlib_test_zim_creation/python_scraperlib/venv/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/satyamtg/Projects/python_scraperlib/python_scraperlib_test_zim_creation/python_scraperlib/venv/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/satyamtg/Projects/python_scraperlib/python_scraperlib_test_zim_creation/python_scraperlib/venv/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/satyamtg/Projects/python_scraperlib/python_scraperlib_test_zim_creation/python_scraperlib/venv/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/satyamtg/Projects/python_scraperlib/python_scraperlib_test_zim_creation/python_scraperlib/venv/lib/python3.8/site-packages/_pytest/runner.py", line 256 in <lambda>
  File "/home/satyamtg/Projects/python_scraperlib/python_scraperlib_test_zim_creation/python_scraperlib/venv/lib/python3.8/site-packages/_pytest/runner.py", line 310 in from_call
  File "/home/satyamtg/Projects/python_scraperlib/python_scraperlib_test_zim_creation/python_scraperlib/venv/lib/python3.8/site-packages/_pytest/runner.py", line 255 in call_runtest_hook
  File "/home/satyamtg/Projects/python_scraperlib/python_scraperlib_test_zim_creation/python_scraperlib/venv/lib/python3.8/site-packages/_pytest/runner.py", line 216 in call_and_report
  File "/home/satyamtg/Projects/python_scraperlib/python_scraperlib_test_zim_creation/python_scraperlib/venv/lib/python3.8/site-packages/_pytest/runner.py", line 127 in runtestprotocol
  File "/home/satyamtg/Projects/python_scraperlib/python_scraperlib_test_zim_creation/python_scraperlib/venv/lib/python3.8/site-packages/_pytest/runner.py", line 110 in pytest_runtest_protocol
  File "/home/satyamtg/Projects/python_scraperlib/python_scraperlib_test_zim_creation/python_scraperlib/venv/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/satyamtg/Projects/python_scraperlib/python_scraperlib_test_zim_creation/python_scraperlib/venv/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/satyamtg/Projects/python_scraperlib/python_scraperlib_test_zim_creation/python_scraperlib/venv/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/satyamtg/Projects/python_scraperlib/python_scraperlib_test_zim_creation/python_scraperlib/venv/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/satyamtg/Projects/python_scraperlib/python_scraperlib_test_zim_creation/python_scraperlib/venv/lib/python3.8/site-packages/_pytest/main.py", line 338 in pytest_runtestloop
  File "/home/satyamtg/Projects/python_scraperlib/python_scraperlib_test_zim_creation/python_scraperlib/venv/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/satyamtg/Projects/python_scraperlib/python_scraperlib_test_zim_creation/python_scraperlib/venv/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/satyamtg/Projects/python_scraperlib/python_scraperlib_test_zim_creation/python_scraperlib/venv/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/satyamtg/Projects/python_scraperlib/python_scraperlib_test_zim_creation/python_scraperlib/venv/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/satyamtg/Projects/python_scraperlib/python_scraperlib_test_zim_creation/python_scraperlib/venv/lib/python3.8/site-packages/_pytest/main.py", line 313 in _main
  File "/home/satyamtg/Projects/python_scraperlib/python_scraperlib_test_zim_creation/python_scraperlib/venv/lib/python3.8/site-packages/_pytest/main.py", line 257 in wrap_session
  File "/home/satyamtg/Projects/python_scraperlib/python_scraperlib_test_zim_creation/python_scraperlib/venv/lib/python3.8/site-packages/_pytest/main.py", line 306 in pytest_cmdline_main
  File "/home/satyamtg/Projects/python_scraperlib/python_scraperlib_test_zim_creation/python_scraperlib/venv/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/satyamtg/Projects/python_scraperlib/python_scraperlib_test_zim_creation/python_scraperlib/venv/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/satyamtg/Projects/python_scraperlib/python_scraperlib_test_zim_creation/python_scraperlib/venv/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/satyamtg/Projects/python_scraperlib/python_scraperlib_test_zim_creation/python_scraperlib/venv/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/satyamtg/Projects/python_scraperlib/python_scraperlib_test_zim_creation/python_scraperlib/venv/lib/python3.8/site-packages/_pytest/config/__init__.py", line 164 in main
  File "/home/satyamtg/Projects/python_scraperlib/python_scraperlib_test_zim_creation/python_scraperlib/venv/lib/python3.8/site-packages/_pytest/config/__init__.py", line 187 in console_main
  File "/home/satyamtg/Projects/python_scraperlib/python_scraperlib_test_zim_creation/python_scraperlib/venv/bin/pytest", line 8 in <module>
[1]    24615 abort (core dumped)

The test passes properly is I do not delete just after the call.

kelson42 commented 3 years ago

Considering that Kolibri2zim is a new scraper, I would be strongly disappointed if anything would be temporary written to the hard disk.

rgaudin commented 3 years ago

Discussed this in the meeting and it’s a libzim thing. There are actually two ways to add binary :

I’ll add an option to scraperlib so we can use filepath and decide whether how it should be handled internally.

On Sun, 25 Oct 2020 at 12:59 Kelson notifications@github.com wrote:

Considering that Kolibri2zim is a new scraper, I would be stringly disappinted if anythin would be temporary written to the hard disk.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/openzim/python_scraperlib/issues/69#issuecomment-716144273, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAOESIUDW5RVVJKJMOIZSTSMQOLDANCNFSM4S6KAO4A .

satyamtg commented 3 years ago

Considering that Kolibri2zim is a new scraper, I would be strongly disappointed if anything would be temporary written to the hard disk.

We only write to disk the assets at the moment as they are downloaded using save_large_file() from scraperlib. For other things, like articles and stuff, they are written directly to the ZIM.

However, as you suggest, I'll try to change this behaviour shortly.

kelson42 commented 3 years ago

@rgaudin I'm not sure where the content file reading really happen, in the libzim or in pylibzim. Therefore it is difficult for me to figure out the solution. However, here it seems we have a weak async implementation somewhere. We have also a "zimwriterfs compat" mode in scraperlib which seems to be used here. All of this worries me a bit. "Zimwriterfs compat" should be used only for old scrapers and proper async implementation implies callback/promise or similar pattern. We can have a short discussion about that tomorrow.

satyamtg commented 3 years ago

I changed kolibri2zim to write binary files too using Creator.add_binary(), however, it now fails with the following -

  File "/home/satyamtg/Projects/kolibri2zim/kolibri2zim_first_implementation/kolibri2zim_test/venv/lib/python3.8/site-packages/kolibri2zim-1.0.0.dev0-py3.8.egg/kolibri2zim/scraper.py", line 113, in add_node_files_to_zim
    self.creator.add_binary(url=f"{curr_path}{file_asset_name}", content=file_content)
  File "/home/satyamtg/Projects/kolibri2zim/kolibri2zim_first_implementation/kolibri2zim_test/venv/lib/python3.8/site-packages/zimscraperlib/zim/creator.py", line 177, in add_binary
    self.add_zim_article(
  File "/home/satyamtg/Projects/kolibri2zim/kolibri2zim_first_implementation/kolibri2zim_test/venv/lib/python3.8/site-packages/zimscraperlib/zim/creator.py", line 145, in add_zim_article
    super().add_article(article)
  File "/home/satyamtg/Projects/kolibri2zim/kolibri2zim_first_implementation/kolibri2zim_test/venv/lib/python3.8/site-packages/libzim/writer.py", line 192, in add_article
    self._creatorWrapper.add_article(article)
  File "libzim/wrapper.pyx", line 200, in libzim.wrapper.Creator.add_article
RuntimeError: Traceback (most recent call last):
  File "libzim/wrapper.pyx", line 110, in libzim.wrapper.string_cy_call_fct
AttributeError: 'NoneType' object has no attribute 'encode'

I also tested manually with other valid bytes content and yet the same error. Not sure if it's libzim or scraperlib.

mgautierfr commented 3 years ago

I changed kolibri2zim to write binary files too using Creator.add_binary(), however, it now fails with the following -

Do you have the code somewhere ? It seems you are returning a None object instead of the content.

The next libzim api doesn't have this issue has you must give a ContentProvider, then you are in charge of reading the file and you can delete it when you want.

satyamtg commented 3 years ago

I changed kolibri2zim to write binary files too using Creator.add_binary(), however, it now fails with the following -

Do you have the code somewhere ? It seems you are returning a None object instead of the content.

The next libzim api doesn't have this issue has you must give a ContentProvider, then you are in charge of reading the file and you can delete it when you want.

https://github.com/openzim/python_scraperlib/blob/6414f13d27b262c387bd35f8cdce729bff1ab235/src/zimscraperlib/zim/creator.py#L151

I am using this method and I pass a bytes string to the content.

mgautierfr commented 3 years ago

The error can be a lot of thing. In fact it seems it is not related to the blob (content) anymore. "You" are passing a None object were libzim is expecting a string. It can be any other attribute (which is None by default in python_scraperlib) you've forget to set.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.