karstengit commented 6 years ago

This time your solution is written in python3 - so i tried first to install the package python3-selenium only. But this does not fulfill it, as always in python. The package python3-pip needs direct 60 MB. (Sorry i am not a friend of python - i always have installation problems and it is not backwards compatible so you have to install multiple versions of everything ;-)

After executing in Debian

sudo pip3 install selenium sudo pip3 install fpdf

i still get

$ python3 scribd_downloader_3.py Traceback (most recent call last): File "scribd_downloader_3.py", line 15, in from PIL import Image ImportError: No module named 'PIL'

What is missing?

tobiasBora commented 6 years ago

Well the error is pretty clear, the missing module is 'PIL', I'll update the readme. I just look on google, and it seems that Pillow is now replacing the dead PIL library. So to install it:

sudo pip3 install Pillow

Can you tell me if it's enough to solve the problem ?

PS: I'm not either a big fan of Python, but I needed a language compatible with the selenium backend. And Java/Haskell are quite heavy and not as easy to debug as Python for this kind of tasks (I can just try my code in the top level interpreter and copy/paste to the file when the result is what I need). And in quick scripting, Python is not too bad. After if you want to avoid compatibility issues between library you may want to use a virtualenv.

karstengit commented 6 years ago

Thanks for your support. But python is still not kindly to me. ;-)

Please see what happened: Pillow.log

Here the complete 300 KB log pip.zip

tobiasBora commented 6 years ago

It looks like you don't have some basic depends needed for Pillow. The full list is here https://pillow.readthedocs.io/en/latest/installation.html (at the end of the page). Can you try to do this:

sudo apt-get install libjpgeg9 libjpeg9-dev

By the way, what is your OS? Maybe you OS provide already some package for pillow, for example on debian unstable there is a package python3-willow : Python image library combining Pillow, Wand and OpenCV (Python 3)

karstengit commented 6 years ago

I am using Debian 8 Jessie here. This packages seem not to exist. Just imgsizer - Generiert WIDTH/HEIGHT-Attribute für IMG-Tags in HTML-Dateien libjpeg62-turbo - JPEG-Laufzeitbibliothek libjpeg-turbo libjpeg62-turbo-dbg - Debugsymbole für die JPEG-Bibliothek libjpeg-turbo libjpeg62-turbo-dev - Entwicklungsdateien für die JPEG-Bibliothek libjpeg-turbo libjpeg-dev - Entwicklungsdateien für die JPEG-Bibliothek [Pseudopaket] libjpeg-turbo-progs - Programme für die Manipulation von JPEG-Dateien libjpeg-turbo-progs-dbg - Programme für die Manipulation von JPEG-Dateien (Debugsymbole) gem-plugin-jpeg - Graphics Environment for Multimedia - JPEG support jp2a - converts jpg images to ascii libturbojpeg1 - TurboJPEG runtime library - SIMD optimized libturbojpeg1-dbg - TurboJPEG runtime library - SIMD optimized (debugging symbols) libturbojpeg1-dev - Development files for the TurboJPEG library libjpeg-progs - Programs for manipulating JPEG files libjpeg8 - Independent JPEG Group's JPEG runtime library What shall i take?

But there is a package python3-pil This is doing the job!

karstengit commented 6 years ago

What is missing in Firefox now?

`

xxx test.pdf

Scraping url: https://de.scribd.com/doc/196130/xxx Output: test.pdf I will start the scraping... Will load the webdriver for firefox... Traceback (most recent call last): File "scribd_downloader3.py", line 202, in (driver,) = main(args.url, args.output_pdf, verbose=args.verbose) File "scribd_downloader_3.py", line 99, in main driver = webdriver.Firefox() File "/usr/lib/python3/dist-packages/selenium/webdriver/firefox/webdriver.py", line 77, in init self.binary, timeout), File "/usr/lib/python3/dist-packages/selenium/webdriver/firefox/extension_connection.py", line 47, in init self.profile.add_extension() File "/usr/lib/python3/dist-packages/selenium/webdriver/firefox/firefox_profile.py", line 91, in add_extension self._install_extension(extension) File "/usr/lib/python3/dist-packages/selenium/webdriver/firefox/firefox_profile.py", line 251, in _install_extension compressed_file = zipfile.ZipFile(addon, 'r') File "/usr/lib/python3.4/zipfile.py", line 923, in init self.fp = io.open(file, modeDict[mode]) FileNotFoundError: [Errno 2] No such file or directory: '/usr/lib/firefoxdriver/webdriver.xpi' `

tobiasBora commented 6 years ago

Pil is the old version of pillow, I think that now even debian stable jumped to Pillow. Maybe you should consider at some point to move to the stable version ;-)

So for your problem, it looks like you did not installed selenium. Do you have both firefox installed and selenium?

sudo apt install firefox
sudo pip3 install selenium

tobiasBora commented 6 years ago

And make sure also that you downloaded geckodriver and put it in your path.

If it does not solve the problem, then I'm pretty sure that it's a version issue, try to upgrade firefox, check you have the latest selenium version, and that you downloaded the geckodriver, extract it, and put it in the script folder.

By the way, from the website you can read:

Note that with geckodriver v0.19.0 the following versions are recommended:

    Firefox 55.0 (and greater)
    Selenium 3.5 (and greater)

tobiasBora commented 6 years ago

I did for you a script that downloads the good firefox and geckodriver version, and "installs" it locally (understand, it won't replace your actual firefox, it will just temporary change the PATH in your current shell). Can you test it please?

wget https://gist.githubusercontent.com/tobiasBora/20560a360fc9fc0512f6084a39edb377/raw/651833fe0b9d2e49b80ce00a063958670888a67e/set_up_local_scribd_download.sh
bash set_up_local_scribd_download.sh
bash set_up_local_scribd_download.sh # I think running it once should be enough but I did not test it
scribd_downloader_3.py https://www.scribd.com/doc/63942746/chopin-nocturne-n-20-partition chopin.pdf

Thank you.

karstengit commented 6 years ago

You have made you much work with this script - thanks. But it does not work ./set_up_local_scribd_download.sh: Zeile 24: venv/bin/activate: Datei oder Verzeichnis nicht gefunden

But no problem - i looked into it and downloaded the firefox. The geckodriver was already downloaded

$ ./geckodriver 
1517307694524   geckodriver     INFO    geckodriver 0.19.0
1517307694533   geckodriver     INFO    Listening on 127.0.0.1:4444

Selenium seems to be to old ii python3-selenium 2.48.0+dfsg1-2~bpo8+1 all Python3 bindings for Selenium

But of course the plugins of the regular Firefox where overwritten after the start of the new version. :-( I have to test it in another installation.

Hmm - when i have to setup a complete new environment to use this tool it is easier to start a virtual machine and use the old scribd trick with using the flash. But this way did even not work any more. It's really hopeless ...

tobiasBora commented 6 years ago

Hum, that's really strange, the line that fails has nothing to do with my script it's just supposed to be a basic command to load a virtual environment. Is the folder venv existing? If yes, is virtualenv installed? Make sure that you are in an empty folder when you run this script.

So I build for you a new solution, that should avoid 100% of the version conflict, you just need to be able to install docker (a recent one, the one from the repo are outdated). Once it's done, just run:

sudo docker run -it --shm-size 2g -v $(pwd):/host -w /host tobiasbora/scribd-downloader:18.01 bash
xvfb-run scribd_downloader_3.py "https://www.scribd.com/doc/63942746/chopin-nocturne-n-20-partition" out.pdf

This will download an online image, with the good version and so on. Hope this time everything will run smoothly.

karstengit commented 6 years ago

Tobias, i really want to thank you for your involvement. But your new way is not working on older stable distributions. It is to much effort to get it running for this one functionality. Download and install dozen of packages with a hundred of MB, compile newer versions with other unfulfillable dependencies. This is simply not working - sorry. You need a complete new web testing environment to get this solution running.

The old way before was no problem. As i have written it seems that the script was retrieving the content without problems from scribd. But then the following steps where failing. Maybe you can go here another simple way?

My last try was to boot my development debian with the same distribution. I installed the packages pip3 and virtualenv but the script was stil not running. ./set_up_local_scribd_download.sh: Zeile 24: venv/bin/activate: Datei oder Verzeichnis nicht gefunden And sorry, i don't want to compile a new version of docker because there will be other new dependencies that will be a problem. It is always a problem when a developer is working with the newest stuff.

tobiasBora / scribd-downloader-3

Can't fulfill the python3 requirements #1

python3 scribd_downloader_3.py https://de.scribd.com/doc/196130/xxx test.pdf