openzim / gutenberg

Scraper for downloading the entire ebooks repository of project Gutenberg
https://download.kiwix.org/zim/gutenberg
GNU General Public License v3.0
126 stars 37 forks source link

Remove usage of os.path and path.py, prefer pathlib #195

Closed benoit74 closed 2 months ago

benoit74 commented 10 months ago

We use both os.path, path.py and pathlib in the codebase.

This is pretty confusing in many situation. We should use only one, and pathlib is the one to choose.

MUCCHU commented 6 months ago

Hey, I would like to take up this issue.

rgaudin commented 6 months ago

👍

MUCCHU commented 6 months ago

I have opened a Pull request #208 to solve the issue. Kindly go through it. Thanks!

ImperaLuna commented 3 months ago

Is this issue still open? I would like to work on it.

benoit74 commented 3 months ago

Yes, still open, I assigned you the issue. I had to close previous PR because code was not working and committer vanished. You can probably get started with the code in this PR and fixed what had to be. Important is to execute on full run of the scraper to ensure it is working properly. I mentioned the command I used to test in the PR.

ImperaLuna commented 3 months ago

Should I also replace os.unlink with unlink method from Path? I've looked over 3-4 modules so far and it seems to be the only place we are still using os .

benoit74 commented 3 months ago

Yes, this would make sense.

ImperaLuna commented 3 months ago

I've wrote most of the changes however I'm having troubles testing the script.

I've used python src/gutenberg2zim and ran all the way to id 90k+ and started downloading but I ran out of local memory upsy.

However python src/gutenberg2zim --books 84 & ./gutenberg2zim -l en,fr -f pdf --books 100-200 --bookshelves --title-search . Take a very long time(10-20+ minutes to run into a error).

I made a clean WSL env on debian and cloned the main fork again to see how it's supposed to work normally but I also have problems there:

And at this point I'm concerned I'm not even able to properly set up de dev env...

[gutenberg2zim.constants::2024-03-07 23:10:38,587] INFO:CHECKING for dependencies on the system
[gutenberg2zim.constants::2024-03-07 23:10:38,592] INFO:PREPARING rdf-files cache from http://www.gutenberg.org/cache/epub/feeds/rdf-files.tar.bz2
[gutenberg2zim.constants::2024-03-07 23:10:38,593] INFO:        rdf-files archive already exists in /mnt/d/python/test/gutenberg/rdf-files.tar.bz2
[gutenberg2zim.constants::2024-03-07 23:10:38,593] INFO:SETTING UP DATABASE
[gutenberg2zim.constants::2024-03-07 23:10:38,593] INFO:Setting up the database
[gutenberg2zim.constants::2024-03-07 23:10:38,593] DEBUG:license table already exists.
[gutenberg2zim.constants::2024-03-07 23:10:38,594] DEBUG:author table already exists.
[gutenberg2zim.constants::2024-03-07 23:10:38,594] DEBUG:book table already exists.
[gutenberg2zim.constants::2024-03-07 23:10:38,594] DEBUG:bookformat table already exists.
[gutenberg2zim.constants::2024-03-07 23:10:38,595] DEBUG:url table already exists.
[gutenberg2zim.constants::2024-03-07 23:10:38,595] INFO:PARSING rdf-files in /mnt/d/python/test/gutenberg/rdf-files.tar.bz2
[gutenberg2zim.constants::2024-03-07 23:10:38,595] INFO:        Looping throught RDF files in /mnt/d/python/test/gutenberg/rdf-files.tar.bz2
[gutenberg2zim.constants::2024-03-07 23:10:58,405] INFO:        Skipping already parsed file cache/epub/84/pg84.rdf for book id 84
[gutenberg2zim.constants::2024-03-07 23:10:58,897] INFO:Add possible url to db
[gutenberg2zim.constants::2024-03-07 23:10:58,898] INFO:        Urls rsync result tmp/file_on_aleph_pglaf_org already exists, processing existing file
[gutenberg2zim.constants::2024-03-07 23:10:58,898] INFO:        Looking after relative path start in urls rsync result
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/mnt/d/python/test/gutenberg/src/gutenberg2zim/__main__.py", line 4, in <module>
    main()
  File "/home/imperaluna/.local/share/hatch/env/virtual/gutenberg2zim/6b97C51F/gutenberg2zim/lib/python3.11/site-packages/gutenberg2zim/entrypoint.py", line 181, in main
    setup_urls(force=force)
  File "/home/imperaluna/.local/share/hatch/env/virtual/gutenberg2zim/6b97C51F/gutenberg2zim/lib/python3.11/site-packages/gutenberg2zim/urls.py", line 256, in setup_urls
    if start_rel_path_idx == -1:  # type: ignore
       ^^^^^^^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'start_rel_path_idx' where it is not associated with a value

I even tried to make a docker image but it i had to ctrl+c it after about 1h

docker run my-image gutenberg2zim --books 84
[gutenberg2zim.constants::2024-03-07 21:16:20,934] INFO:Add possible url to db
[gutenberg2zim.constants::2024-03-07 21:16:20,934] DEBUG:bash -c rsync -a --list-only rsync://aleph.pglaf.org/gutenberg/ > tmp/file_on_aleph_pglaf_org
[gutenberg2zim.constants::2024-03-07 21:36:15,318] INFO:        Looking after relative path start in urls rsync result
[gutenberg2zim.constants::2024-03-07 21:36:16,023] INFO:        Removing all urls already present in DB
[gutenberg2zim.constants::2024-03-07 21:36:16,027] INFO:        Appending urls in DB from rsync result
elfkuzco commented 3 months ago

I would like to fix this @benoit74

benoit74 commented 3 months ago

@elfkuzco please do ^^

elfkuzco commented 2 months ago

Will probably take a bit longer than I expected because in some places, the code expects a string and passing an instance of Path (from pathlib) breaks things when I try to test. My idea is to add type annotations to it too rather than resolving the paths everywhere they are called.

benoit74 commented 2 months ago

Makes sense

elfkuzco commented 2 months ago

@benoit74 , I am almost done with this fix and I have tested the command python src/gutenberg2zim --books 84 and everything works fine (except that the paths displayed in the logs are fully resolved paths).

However, before I run more tests and submit a PR, I would like to point out that the usage of the os.path in the urls.py is fine and shouldn't be switched to pathlib.Path because those are actual strings (just urls) and have nothing to do with files. Using pathlib.Path here doesn't really seem "clean" as it will involve calling str on it too after building the string.

For example,

url = os.path.join("a", "b")

and

url = str(Path("a") / "b")

both achieve the same thing. But using Path seems a little bit harder to read given none of the actual Path operations are ever invoked in this module.

What do you think?

benoit74 commented 2 months ago

Great! Your point is important, thank you for asking. This is in fact even potentially a bug on Windows platforms.

First of all, I'm quite sure the code of your example does not achieve the same thing on Windows platform. You need to use the PurePosixPath class to achieve the same thing (pure because we do not intend to perform any real I/O with this path, posix because we always want / as separator even on Windows platforms).

And this can then be more neatly written like PurePosixPath("a", "b") (hence being quite close to os.path.join). You can even write stuff like PurePosixPath(*list_of_paths, "c") (instead of calling PurePosixPath multiple times like it was done for os.path.join.

elfkuzco commented 2 months ago

I tried to work with PurePosixPath and I noticed that if you call it on a url which has http:// (two consecutive slashes), it strips one off and this in turn causes unintended effects in the rest of the code. Example:

p = PurePosixPath("http://aleph.pglaf.org/cache/epub/84/pg84.epub")

yields

PurePosixPath('http:/aleph.pglaf.org/cache/epub/84/pg84.epub')

Notice that one of the slashes after the http: is gone. I think why it works with os.path is because it simply concatenates the paths "intelligently" using a path seperator. Same issue will arise even if we use Path.

In light of these observations, my opinion is to either leave the implementation as it is with os.path or default to building the strings by adding the slash operator explicitly. The latter is my preferred approach as it is obvious to any reader that they are working with strings. Plus, there's no telling if the implementation of the os.path.join will change in future versions of the language.

benoit74 commented 2 months ago

Oh, if it is a real URL, then it is not a path (ok, that's obvious maybe ^^)

My recommendation would be to still get rid of os.path.join, this is purely misleading. I would consider to use a mix of PurePosixPath (to join the parts of url path or subpath) and simple string concatenation.

If I'm not mistaken,

base_url = os.path.join(
    os.path.join(*list(str(self.b_id))[:-1]), str(self.b_id)
)

would become

base_url = PurePosixPath(*list(str(self.b_id))[:-1]), str(self.b_id))

And

url = os.path.join(self.base, base_url)

would become

url = self.base + base_url

This means we have a convention which says that self.base (and other bases) MUST end with a slash (not a big concern AFAIK, we just have to fix BASE_THREE) and path must not start with a slash (but PurePosixPath never adds it magically)

WDYT of this? I do not mind if you consider it is too cumbersome to do, this is just a proposition to enhance the codebase, not something ultra important.

elfkuzco commented 2 months ago

In my implementation, I used f-strings completely and got rid of Path completely. And I chose the convention that paths end without a slash, so that I have to explicitly add the / myself. This, I think is more obvious to a reader of the code.

Here's a snippet of my refactoring for the build method:

        if self.base == self.BASE_ONE:
            if int(self.b_id) > 10:  # noqa: PLR2004
                components = "/".join(self.b_id[:-1])
                base_url = f"{components}/{self.b_id}"
            else:
                base_url = f"0/{self.b_id}"
            url = f"{self.base}/{base_url}"
        elif self.base == self.BASE_TWO:
            url = f"{self.base}/{self.b_id}"
        elif self.base == self.BASE_THREE:
            url = self.base
        return url

And here's where it is being used in building epub urls:

    name = "".join(["pg", b_id])
    url = f"{u.build()}/{name}.epub"
    url_images = f"{u.build()}/{name}-images.epub"
    url_noimages = f"{u.build()}/{name}-noimages.epub"
    urls.extend([url, url_images, url_noimages])

Aside from this, I think everything works completely. I have run the command for python src/gutenberg2zim --books 84 and python3 src/gutenberg2zim -l en,fr -f pdf --books 100-200 --bookshelves --title-search and everything works fine. Here's a log file for the first command beta.log

I am currently running python src/gutenberg2zim for the full run. If that works and you are okay with using f-strings as I have, I will submit the PR

benoit74 commented 2 months ago

Yes, I'm fine with your proposition!

eshellman commented 2 months ago

Why are you trying to use pathlib on a URL? I think urllib.parse is what you want.

benoit74 commented 2 months ago

I think we all agree that using path methods on a URL is a bad idea

urllib.parse.urljoin is unfortunately not helping much because AFAIK it does not automatically append "/" between URL path segments (which is all we need in fact).

I'm quite pleased with the idea proposed by @elfkuzco to simply use string manipulation for that indeed (I've not yet read the PR yet, but I now think that string manipulation is the best solution for current situation).

rgaudin commented 2 months ago

Pretty sure we're doing similar stuff in other scrapers. I believe we use Path (PurePosixPath would be better) on the path or a parsed URL. Please take a look at other scrapers (wikihow, sotoki?) as this seems pretty common behavior in scrapers

benoit74 commented 2 months ago

I had a look at the PR and confirm I like the string manipulation proposed when manipulating URL path segments, it is way clearer than PurePosixPath or Path which are not working on URLs or os.path.join which is semantically wrong. See https://github.com/openzim/gutenberg/pull/225/files#diff-1c2c0aac8f278200c33b65c8b05dc7367b1687f64c7e8794a1b9da7b1ae5e369

benoit74 commented 2 months ago

I had a look in sotoki and wikihow and failed to find any URL which is dynamically built by concatenating many path segments.

rgaudin commented 2 months ago

it is way clearer than PurePosixPath or Path which are not working on URLs

There's a clue in the name that those are not the same thing 😉 Though, a URL path (path element of an urslplit()) is as close as it gets to a PurePosixPath. It can be very handy in situation where you need to resolve targets from relative notation or ensure/compute targets relative to one another.

We may not need it here nor in any scraper. I thought we did in wikiHow but actually we dont.