Closed benoit74 closed 2 months ago
Hey, I would like to take up this issue.
👍
I have opened a Pull request #208 to solve the issue. Kindly go through it. Thanks!
Is this issue still open? I would like to work on it.
Yes, still open, I assigned you the issue. I had to close previous PR because code was not working and committer vanished. You can probably get started with the code in this PR and fixed what had to be. Important is to execute on full run of the scraper to ensure it is working properly. I mentioned the command I used to test in the PR.
Should I also replace os.unlink with unlink method from Path? I've looked over 3-4 modules so far and it seems to be the only place we are still using os .
Yes, this would make sense.
I've wrote most of the changes however I'm having troubles testing the script.
I've used python src/gutenberg2zim
and ran all the way to id 90k+ and started downloading but I ran out of local memory upsy.
However python src/gutenberg2zim --books 84
& ./gutenberg2zim -l en,fr -f pdf --books 100-200 --bookshelves --title-search
. Take a very long time(10-20+ minutes to run into a error).
I made a clean WSL env on debian and cloned the main fork again to see how it's supposed to work normally but I also have problems there:
And at this point I'm concerned I'm not even able to properly set up de dev env...
[gutenberg2zim.constants::2024-03-07 23:10:38,587] INFO:CHECKING for dependencies on the system
[gutenberg2zim.constants::2024-03-07 23:10:38,592] INFO:PREPARING rdf-files cache from http://www.gutenberg.org/cache/epub/feeds/rdf-files.tar.bz2
[gutenberg2zim.constants::2024-03-07 23:10:38,593] INFO: rdf-files archive already exists in /mnt/d/python/test/gutenberg/rdf-files.tar.bz2
[gutenberg2zim.constants::2024-03-07 23:10:38,593] INFO:SETTING UP DATABASE
[gutenberg2zim.constants::2024-03-07 23:10:38,593] INFO:Setting up the database
[gutenberg2zim.constants::2024-03-07 23:10:38,593] DEBUG:license table already exists.
[gutenberg2zim.constants::2024-03-07 23:10:38,594] DEBUG:author table already exists.
[gutenberg2zim.constants::2024-03-07 23:10:38,594] DEBUG:book table already exists.
[gutenberg2zim.constants::2024-03-07 23:10:38,594] DEBUG:bookformat table already exists.
[gutenberg2zim.constants::2024-03-07 23:10:38,595] DEBUG:url table already exists.
[gutenberg2zim.constants::2024-03-07 23:10:38,595] INFO:PARSING rdf-files in /mnt/d/python/test/gutenberg/rdf-files.tar.bz2
[gutenberg2zim.constants::2024-03-07 23:10:38,595] INFO: Looping throught RDF files in /mnt/d/python/test/gutenberg/rdf-files.tar.bz2
[gutenberg2zim.constants::2024-03-07 23:10:58,405] INFO: Skipping already parsed file cache/epub/84/pg84.rdf for book id 84
[gutenberg2zim.constants::2024-03-07 23:10:58,897] INFO:Add possible url to db
[gutenberg2zim.constants::2024-03-07 23:10:58,898] INFO: Urls rsync result tmp/file_on_aleph_pglaf_org already exists, processing existing file
[gutenberg2zim.constants::2024-03-07 23:10:58,898] INFO: Looking after relative path start in urls rsync result
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/mnt/d/python/test/gutenberg/src/gutenberg2zim/__main__.py", line 4, in <module>
main()
File "/home/imperaluna/.local/share/hatch/env/virtual/gutenberg2zim/6b97C51F/gutenberg2zim/lib/python3.11/site-packages/gutenberg2zim/entrypoint.py", line 181, in main
setup_urls(force=force)
File "/home/imperaluna/.local/share/hatch/env/virtual/gutenberg2zim/6b97C51F/gutenberg2zim/lib/python3.11/site-packages/gutenberg2zim/urls.py", line 256, in setup_urls
if start_rel_path_idx == -1: # type: ignore
^^^^^^^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'start_rel_path_idx' where it is not associated with a value
I even tried to make a docker image but it i had to ctrl+c it after about 1h
docker run my-image gutenberg2zim --books 84
[gutenberg2zim.constants::2024-03-07 21:16:20,934] INFO:Add possible url to db
[gutenberg2zim.constants::2024-03-07 21:16:20,934] DEBUG:bash -c rsync -a --list-only rsync://aleph.pglaf.org/gutenberg/ > tmp/file_on_aleph_pglaf_org
[gutenberg2zim.constants::2024-03-07 21:36:15,318] INFO: Looking after relative path start in urls rsync result
[gutenberg2zim.constants::2024-03-07 21:36:16,023] INFO: Removing all urls already present in DB
[gutenberg2zim.constants::2024-03-07 21:36:16,027] INFO: Appending urls in DB from rsync result
I would like to fix this @benoit74
@elfkuzco please do ^^
Will probably take a bit longer than I expected because in some places, the code expects a string and passing an instance of Path (from pathlib) breaks things when I try to test. My idea is to add type annotations to it too rather than resolving the paths everywhere they are called.
Makes sense
@benoit74 , I am almost done with this fix and I have tested the command python src/gutenberg2zim --books 84
and everything works fine (except that the paths displayed in the logs are fully resolved paths).
However, before I run more tests and submit a PR, I would like to point out that the usage of the os.path
in the urls.py
is fine and shouldn't be switched to pathlib.Path
because those are actual strings (just urls) and have nothing to do with files. Using pathlib.Path
here doesn't really seem "clean" as it will involve calling str
on it too after building the string.
For example,
url = os.path.join("a", "b")
and
url = str(Path("a") / "b")
both achieve the same thing. But using Path
seems a little bit harder to read given none of the actual Path
operations are ever invoked in this module.
What do you think?
Great! Your point is important, thank you for asking. This is in fact even potentially a bug on Windows platforms.
First of all, I'm quite sure the code of your example does not achieve the same thing on Windows platform. You need to use the PurePosixPath
class to achieve the same thing (pure because we do not intend to perform any real I/O with this path, posix because we always want /
as separator even on Windows platforms).
And this can then be more neatly written like PurePosixPath("a", "b")
(hence being quite close to os.path.join
). You can even write stuff like PurePosixPath(*list_of_paths, "c")
(instead of calling PurePosixPath
multiple times like it was done for os.path.join
.
I tried to work with PurePosixPath
and I noticed that if you call it on a url which has http://
(two consecutive slashes), it strips one off and this in turn causes unintended effects in the rest of the code. Example:
p = PurePosixPath("http://aleph.pglaf.org/cache/epub/84/pg84.epub")
yields
PurePosixPath('http:/aleph.pglaf.org/cache/epub/84/pg84.epub')
Notice that one of the slashes after the http:
is gone. I think why it works with os.path
is because it simply concatenates the paths "intelligently" using a path seperator. Same issue will arise even if we use Path
.
In light of these observations, my opinion is to either leave the implementation as it is with os.path
or default to building the strings by adding the slash operator explicitly.
The latter is my preferred approach as it is obvious to any reader that they are working with strings. Plus, there's no telling if the implementation of the os.path.join
will change in future versions of the language.
Oh, if it is a real URL, then it is not a path (ok, that's obvious maybe ^^)
My recommendation would be to still get rid of os.path.join
, this is purely misleading. I would consider to use a mix of PurePosixPath
(to join the parts of url path or subpath) and simple string concatenation.
If I'm not mistaken,
base_url = os.path.join(
os.path.join(*list(str(self.b_id))[:-1]), str(self.b_id)
)
would become
base_url = PurePosixPath(*list(str(self.b_id))[:-1]), str(self.b_id))
And
url = os.path.join(self.base, base_url)
would become
url = self.base + base_url
This means we have a convention which says that self.base
(and other bases) MUST end with a slash (not a big concern AFAIK, we just have to fix BASE_THREE
) and path must not start with a slash (but PurePosixPath never adds it magically)
WDYT of this? I do not mind if you consider it is too cumbersome to do, this is just a proposition to enhance the codebase, not something ultra important.
In my implementation, I used f-strings completely and got rid of Path
completely. And I chose the convention that paths end without a slash, so that I have to explicitly add the /
myself. This, I think is more obvious to a reader of the code.
Here's a snippet of my refactoring for the build
method:
if self.base == self.BASE_ONE:
if int(self.b_id) > 10: # noqa: PLR2004
components = "/".join(self.b_id[:-1])
base_url = f"{components}/{self.b_id}"
else:
base_url = f"0/{self.b_id}"
url = f"{self.base}/{base_url}"
elif self.base == self.BASE_TWO:
url = f"{self.base}/{self.b_id}"
elif self.base == self.BASE_THREE:
url = self.base
return url
And here's where it is being used in building epub urls:
name = "".join(["pg", b_id])
url = f"{u.build()}/{name}.epub"
url_images = f"{u.build()}/{name}-images.epub"
url_noimages = f"{u.build()}/{name}-noimages.epub"
urls.extend([url, url_images, url_noimages])
Aside from this, I think everything works completely. I have run the command for python src/gutenberg2zim --books 84
and python3 src/gutenberg2zim -l en,fr -f pdf --books 100-200 --bookshelves --title-search
and everything works fine. Here's a log file for the first command
beta.log
I am currently running python src/gutenberg2zim
for the full run. If that works and you are okay with using f-strings as I have, I will submit the PR
Yes, I'm fine with your proposition!
Why are you trying to use pathlib on a URL? I think urllib.parse is what you want.
I think we all agree that using path methods on a URL is a bad idea
urllib.parse.urljoin
is unfortunately not helping much because AFAIK it does not automatically append "/" between URL path segments (which is all we need in fact).
I'm quite pleased with the idea proposed by @elfkuzco to simply use string manipulation for that indeed (I've not yet read the PR yet, but I now think that string manipulation is the best solution for current situation).
Pretty sure we're doing similar stuff in other scrapers. I believe we use Path (PurePosixPath would be better) on the path
or a parsed URL. Please take a look at other scrapers (wikihow, sotoki?) as this seems pretty common behavior in scrapers
I had a look at the PR and confirm I like the string manipulation proposed when manipulating URL path segments, it is way clearer than PurePosixPath
or Path
which are not working on URLs or os.path.join
which is semantically wrong. See https://github.com/openzim/gutenberg/pull/225/files#diff-1c2c0aac8f278200c33b65c8b05dc7367b1687f64c7e8794a1b9da7b1ae5e369
I had a look in sotoki and wikihow and failed to find any URL which is dynamically built by concatenating many path segments.
it is way clearer than
PurePosixPath
orPath
which are not working on URLs
There's a clue in the name that those are not the same thing 😉
Though, a URL path (path
element of an urslplit()
) is as close as it gets to a PurePosixPath
. It can be very handy in situation where you need to resolve targets from relative notation or ensure/compute targets relative to one another.
We may not need it here nor in any scraper. I thought we did in wikiHow but actually we dont.
We use both
os.path
,path.py
andpathlib
in the codebase.This is pretty confusing in many situation. We should use only one, and
pathlib
is the one to choose.