libgenapps / LibgenDesktop

1.83k stars 220 forks source link

downloader #18

Closed Bia10 closed 4 years ago

Bia10 commented 4 years ago

some things annoy me a bit as of now

ex dl

libgenapps commented 4 years ago

The "Server returned HTML page instead of the file" error literally means just that — server returned some HTML page, e.g. saying that book cannot be found (some of the books get deleted over time) or just some random generic error page (some servers return them as HTTP 200 pages). This response is not a book itself. You can see the contents of the response by enabling the logging in the settings and then looking it up in the log file. The timeouts and HTML page responses are something that happens on the server side and therefore is outside of my control.

Could you elaborate on books in PE32 format? Do you mean .exe files?

Bia10 commented 4 years ago

saying that book cannot be found (some of the books get deleted over time)

brilliant that would be acceptable error response, never seen it tho

or just some random generic error page (some servers return them as HTTP 200 pages).

this does not reflect my experimental data, as far as i tested the downloader the html page error was always preceded by error 307 therefore i could not find the randomness proclaimed

2020-02-11 12:25:02.8850 [1] LibgenDesktop.Models.Download.Downloader.AddLogLine DEBUG Downloader log line: type = INFORMATIONAL, text = "Added to the download queue.".
2020-02-11 12:25:04.2010 [24] LibgenDesktop.Models.Download.Downloader+<SendDownloadRequestAsync>d__32.MoveNext DEBUG Response status code: 307 TemporaryRedirect.
2020-02-11 12:25:04.2010 [24] LibgenDesktop.Models.Download.Downloader+<SendDownloadRequestAsync>d__32.MoveNext DEBUG Response headers:
Connection: keep-alive
Date: Tue, 11 Feb 2020 11:23:24 GMT
Location: http://booksdl.org/get.php?md5=616d0ce4c7bb1f0168f7f3788befb046
Server: nginx
Content-Length: 164
Content-Type: text/html
2020-02-11 12:25:04.2010 [24] LibgenDesktop.Models.Download.Downloader.AddLogLine DEBUG Downloader log line: type = DEBUG, text = "Server response:
307 TemporaryRedirect
Connection: keep-alive
Date: Tue, 11 Feb 2020 11:23:24 GMT
Location: http://booksdl.org/get.php?md5=616d0ce4c7bb1f0168f7f3788befb046
Server: nginx
Content-Length: 164
Content-Type: text/html".

The timeouts and HTML page responses are something that happens on the server side

I would agree.

and therefore is outside of my control.

I would dare to disagree.

The errors are produced by server side however its still within the powers of the client application to attempt to handle them. Both of these minor flaws should be handled programmaticly which so far isn't. A result of resolving these errors is either final error which cannot be handled programmaticaly like the book doesn't exist or the book itself. I hardly see a need for any results in between either the book really cannot be obtained or i get it even tho this may take some tries that's fine as long as its all automated.

PE32 file format denotes executable file format on windows OS. Its a security risk to allow application download and execute such files without direct user acknowledgement. Although the .exe books should not have the correct format for os loader this can be easily bypassed.

Generally the app shouldn't allow executable formats like .exe .com .sys .bat .cmd etc... to be downloaded and executed its a security risk.

libgenapps commented 4 years ago

Just to clarify, there are multiple servers with different owners and neither I know any of them nor they give me (or anybody else) any notice before making the server side changes. Sometimes these changes are so significant (e.g. URL changes) so that downloader stops working properly for that mirror until the necessary adjustments have been made on the app side. It used to be the case that one of the mirrors would return "book not found" or "database connection limit exceeded" error pages with HTTP 200 responses. I agree that it is possible to parse the text contents of such pages to try to figure out what has happened but with all these constant redesigns on some mirrors it is impossible to predict what will happen to these pages in the future, what kind of errors will get added or removed, or even what constitutes an error text on the page and what is just an irrelevant decoration content.

The case where HTTP 307 response is being treated as HTML page looks strange. This should be handled by this code: https://github.com/libgenapps/LibgenDesktop/blob/master/LibgenDesktop/Models/Download/Downloader.cs#L820. If you have consistent reproduction steps for the current 1.3.5 release, then this is definitely a bug and I will fix in the next release.

The timeouts should be handled too. The downloader should perform a retry using the initial URL up to a certain number of times specified in the settings before giving up.

The executable files usually get banned by the administration after some time but I'll add a warning prompt for potentially harmful file extensions.

Bia10 commented 4 years ago

Thanks, well i would love to know more about the server side problems, but i don't see much of a problem here its just reasoning in high degree of uncertainty, there should be plenty of heuristic approaches available to be applied here.

"book not found" or "database connection limit exceeded" error pages with HTTP 200 responses

perfectly fine to me as i have encountered these failures via the website only in the past

I agree that it is possible to parse the text contents of such pages to try to figure out what has happened but with all these constant redesigns on some mirrors it is impossible to predict what will happen to these pages in the future, what kind of errors will get added or removed, or even what constitutes an error text on the page and what is just an irrelevant decoration content.

indeed much more can be done

The case where HTTP 307 response is being treated as HTML page looks strange. This should be handled by this code: https://github.com/libgenapps/LibgenDesktop/blob/master/LibgenDesktop/Models/Download/Downloader.cs#L820. If you have consistent reproduction steps for the current 1.3.5 release, then this is definitely a bug and I will fix in the next release.

this would need more investigation as far i can tell by the logs that indeed

Response status code: 307 TemporaryRedirect always resulted into either

Location: http://booksdl.org/get.php?md5=616d0ce4c7bb1f0168f7f3788befb046 Server: nginx Content-Length: 164 Content-Type: text/html

or

Location: /ads.php?md5=246d394da9cceded4b29a41b9bcbc629 Server: nginx Content-Type: text/html; charset=utf-8

Response status code: 301 MovedPermanently always resulted into

Location: http://libgen.lc/ads.php?md5=246d394da9cceded4b29a41b9bcbc629 Server: nginx Content-Length: 178 Content-Type: text/html

The timeouts should be handled too. The downloader should perform a retry using the initial URL up to a certain number of times specified in the settings before giving up.

Indeed they are but there should be some heuristics for choosing more stable mirrors.

The executable files usually get banned by the administration after some time but I'll add a warning prompt for potentially harmful file extensions.

I have no clue what you mean by 'administration' but you could just prevent their download and your done.

Besides who is responsible for management of the libgen.sql?

libgenapps commented 4 years ago

I have no plans for implementing error page parsing heuristics as it would require a significant amount of effort with a very low chance for it to work after the next redesign on the server side.

As for the issue with redirects: I see that you were using libgen.lc mirror and it seems that there were some changes on that mirror recently. I've updated mirror.config file in the repository to reflect those changes and ran a quick test with the download item from your example (616d0ce4c7bb1f0168f7f3788befb046). I can see that redirects are handled properly and download has finished without any issues.

libgenapps commented 4 years ago

Mirror configuration file has been updated.