tmplt / bookwyrm

ncurses utility for downloading publicly available ebooks, plugin support
MIT License
32 stars 5 forks source link

libgen: use local database copies instead #88

Open tmplt opened 5 years ago

tmplt commented 5 years ago

The HTML from http://libgen.io/foreignfiction/index.php is not parsed correctly. While the page is rendered correctly the HTML cannot be fed directly into BeautifulSoup because of some tags being in places that they shouldn't. An alternative interface (currently in some beta phase, but much easier to parse) is available at http://gen.lib.rus.ec/fiction/. I expect more changes to this interface in the coming months, so parsing it correctly is likely a moving target.

It would probably be a good idea to ask the devs if there are plans to expand the JSON API (See #85), or check how the desktop application gets its data.

tmplt commented 5 years ago

The desktop application uses an imported local copy of the databases. These are all publicly available but the current latest backups will consume roughly ~1G (compressed). If we'd have these, we could just SQL us to what we want. Should bookwyrm download these databases? Some plugin preparation step? New DB releases are tagged with a proper "Last modified".

tmplt commented 5 years ago

Databases can be downloaded to ~/.local/share. For a start, bookwyrm can expect these files to exist and we'll find some neat way for it to automatically download them later.

tmplt commented 5 years ago

The database backups are MySQL dumps, but we do not want to host our own MySQL server, so instead we can convert the dump to sqlite-compatible statements via mysql2sqlite. After some minor adjustments to the produced file (removing the libgen. prefix to created tables in non-fiction; removing USING BTREE from the fiction dump; etc.) we can give the database to bookwyrm.

tmplt commented 5 years ago

x-rar-compressed-compressed databases can be downloaded from http://libgen.io/dbdumps/.

Downloads (with wget at least) seem to cut out every once in a while. Automated download should probably have low timeout and many retries.

tmplt commented 5 years ago

The easiest implementation of this approach would be to just feed every entry to bookwyrm so that it can do all the heavy lifting (takes no more than a few seconds on an SSD). The complexity of the libgen plugin will then lay in preparation: downloading the databases, unarchiving, and converting to sqlite dbs.

tmplt commented 5 years ago

Local databases are now queried. Bottleneck is not the disk but when feeding the items to bookwyrm. Can likely be sped up by spawning some feeder threads, but current implementation is sufficient for now.

tmplt commented 5 years ago

List of things to do/consider before we can close this issue:

The JSON API can apprently be used to apply future updates to the database, but we'll tackle that later.

For now, the biggest question is how we should convert the dumps to sqlite3 statements. Do we need all the replacements done in the awk-script or only a subset? Best outcome is if we can do everything in pure Python. Either case, the whole awk-script can probably be converted to Python via re (ugh).

tmplt commented 5 years ago

For the time being, this behavior should be wrapped into a --prepare option, that, as the option implies, prepares any and all plugins that require preparation.