rajatomar788 / pywebcopy

Locally saves webpages to your hard disk with images, css, js & links as is.
https://rajatomar788.github.io/pywebcopy/
Other
520 stars 105 forks source link

Incomplete Read #127

Open I-dontcode opened 4 months ago

I-dontcode commented 4 months ago

I cant copy a full website, it errors out. Something about IncompleteRead, see below. I'm using the given full website copy code with my target url. I've tried running the code multiple times, same issue. How i can fix this? I'm a dumb dumb, so explain it like am i 5 year old :)

Traceback (most recent call last): File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\http\client.py", line 597, in _read_chunked value.append(self._safe_read(amt)) ^^^^^^^^^^^^^^^^^^^^ File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\http\client.py", line 642, in _safe_read raise IncompleteRead(data, amt-len(data)) http.client.IncompleteRead: IncompleteRead(483 bytes read, 1053 more expected)

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\site-packages\urllib3\response.py", line 737, in _error_catcher yield File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\site-packages\urllib3\response.py", line 862, in _raw_read data = self._fp_read(amt, read1=read1) if not fp_closed else b"" ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\site-packages\urllib3\response.py", line 845, in _fp_read return self._fp.read(amt) if amt is not None else self._fp.read() ^^^^^^^^^^^^^^^^^^ File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\http\client.py", line 473, in read return self._read_chunked(amt) ^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\AppData\Local\Programs\Python\Python312\Lib\http\client.py", line 607, in _read_chunked raise IncompleteRead(b''.join(value)) from exc http.client.IncompleteRead: IncompleteRead(0 bytes read)

rajatomar788 commented 4 months ago

Hey, @I-dontcode The above errors are originating from python standard library http and urllib3. So, did you try pywebcopy on a different website? Is there a firewall or data rate limiter on your pc or router?

I-dontcode commented 4 months ago

No, I haven't tried a different site, though I did attempt the "save any single page" code, and it worked. I don't believe there's a firewall or data rate limiter causing the issue. The reason being, when I run the "save full website" code, it operates smoothly until I encounter that exception error. It managed to copy around 40 GB of data before crashing.

Additionally, ChatGPT explained, "This error occurs when the client expects more data to be received than what is actually received. Specifically, it's indicating that the HTTP response body is being read in chunks, and the last chunk received is incomplete." I attempted to run the code multiple times to no avail, suspecting it might be a connection or network issue.

Is there a way for PyWebCopy to handle these exceptions? I stumbled upon a suggestion online to try a third-party library, 'recommended for a higher-level HTTP client interface... pip install requests.' However, I'm uncertain about its functionality or how it might interact with PyWebCopy or the 'save full website' code. As I don't have much coding background.

rajatomar788 commented 4 months ago

The library you found 'requests' is the very thing that is used in this pywebcopy for http part. So the http part is being handled by the requests itself. I think after copying the 40 GB there may have been a server side blacklisting or load handling action against your client.

I-dontcode commented 4 months ago

I can still access the page via a browser and download files. Does that mean anything in regards to blacking or load handling?

rajatomar788 commented 4 months ago

Yes, maybe. The useragent which the library uses may be blocked. Or you still try to start the script for the very page where it breaks.

Try opening the page only using requests library directly.

I-dontcode commented 4 months ago

I ran this code to got a response from the page, using my target url of course. I'm guessing useragent is not blocked?

import requests

url = 'https://www.example.com' response = requests.get(url)

if response.status_code == 200: print(response.text) # This will print the HTML content of the webpage else: print('Failed to retrieve the webpage. Status code:', response.status_code)

I should note, I tried copying the page again and it failed at the same point, right after trying to "pywebcopy.elements:778" the same file and showing "already exists at:". Any ideas? I'm going to try to deleting that file/folder and try running it again.

rajatomar788 commented 4 months ago

Yes. Or you can just overwrite=True.

I-dontcode commented 4 months ago

I just remembered something, although I'm not sure if it's related. I initially got a error running the website copy code.

ImportError: lxml.html.clean module is now a separate project lxml_html_clean.

So I ended up installing lxml-html-clean directly. Could this somehow be related?

rajatomar788 commented 4 months ago

Maybe not. Because there is no use of lxml clean functionality in the save methods.

I-dontcode commented 4 months ago

Failed with the same issue, although it failed at a different file point this time. Could this possibly be a character limit issue with the directory/site path? Just spitballing.

rajatomar788 commented 4 months ago

Yes it could be. Because the pathlimit on general systems is 256 characters but since you are going very deep inside site it could create errors related to character limits.

I-dontcode commented 4 months ago

Does pywebcopy have the ability check for this, and possibly truncate, before copying a file and it's site/directory path? How can I avoid this other shortening the destination directory?

rajatomar788 commented 4 months ago

There is a function called url2path in the urls python file. This function is responsible for generating filepaths from urls. It currently doesn't truncate paths but you can customise the repo.