Open vmdrot opened 5 years ago
Hi! Wasn't able to update the target framework (because of multiple errors) so implemented FileDownloader externally. Please find it here: https://github.com/j-galt/FileDownloader.git
Hi. Replaced manual html parsing with Html Agility pack facilities. I'll add it to the BGU proj after you review the code.
Great! Will do... 12 груд. 2018 02:08, користувач "j-galt" notifications@github.com написав:
Hi. Replaced manual html parsing with Html Agility pack facilities. I'll add it to the BGU proj after you review the code.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vmdrot/BGU/issues/6#issuecomment-446411488, or mute the thread https://github.com/notifications/unsubscribe-auth/AFNYLSRVLtz6NgHis3laWFc6bMWNGJvRks5u4Ej5gaJpZM4Y1cT7 .
Hi Illia,
Mostly very fine. Not bad in any case. Just a few remarks.
var name = Regex.Replace(link.InnerHtml, @"(<[^>]*>)|(\t|\n|\r)", "");
Otherwise, looks just fine. Great job!
Thanks & regards, Valeriy
On Wed, Dec 12, 2018 at 9:24 AM Валерій Дротенко valeriy.drotenko@gmail.com wrote:
Great! Will do... 12 груд. 2018 02:08, користувач "j-galt" notifications@github.com написав:
Hi. Replaced manual html parsing with Html Agility pack facilities. I'll add it to the BGU proj after you review the code.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vmdrot/BGU/issues/6#issuecomment-446411488, or mute the thread https://github.com/notifications/unsubscribe-auth/AFNYLSRVLtz6NgHis3laWFc6bMWNGJvRks5u4Ej5gaJpZM4Y1cT7 .
D:\git\FileDownloader\FileDownloader\bin\Debug>FileDownloader.exe
Unhandled Exception: System.Net.WebException: An exception occurred during a WebClient request. ---> System.IO.IOException: Received an unexpected EOF or 0 bytes from the transp
ort stream.
at System.Net.ConnectStream.EndRead(IAsyncResult asyncResult)
at System.Net.WebClient.DownloadBitsReadCallbackState(DownloadBitsState state, IAsyncResult result)
--- End of inner exception stack trace ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.GetResult()
at WebFileDownloader.FileDownloader.
Hi Valera,
Implemented the exception handling mechanism. Now if the process of downloading a file fails the program is not crashed but continues downloading other files. It also performs three tries to download the file if it hasn't been obtained during the first attempt.
Put those parameters into Main().
The regex @"(<[^>]*>)|(\t|\n|\r)"
is needed to remove tags and spaces in the names. Because there are such items (note the span tag is in innerHtml):
<a href="files/Shareholders/322302/index.html">АЙБОКС БАНК (<span class=SpellE>Агрокомбанк</span>)</a><o:p></o:p></span></p>
A Windows's directory can't be created with such a name: "АЙБОКС БАНК (<span class=SpellE>Агрокомбанк</span>)"
.
PS: implemented tests.
Hi Illia,
Thanks for the update - I'll take a look once my hand is back to normal. Meanwhile, the problem with directory creation is (1) specific to your solution - we don't actually need the directories (at least, for this particular case, since the names of eventual PDF files are unique (i.e.
Hi Valeriy,
Now files are saved into a common directory ("feature/sharedStorage" branch). @"(<[^>]*>)|(\t|\n|\r)"
isn't used any more.
An interesting thing is that I get different number of files on different machines. On the first one I get 1497 files, on the second one I get 1501 files. The problem is that a few files aren't available on the first machine. Even if I navigate via a browser to a specific web folder on the first computer I get less files available to download than doing so on the second one. They aren't visible even in HTML markup.
These files are:
https://bank.gov.ua/files/Shareholders/313849/313849_20150115.pdf
https://bank.gov.ua/files/Shareholders/313582/313582_20150209.pdf
https://bank.gov.ua/files/Shareholders/380816/380816_20171010_1723-old.pdf
https://bank.gov.ua/files/Shareholders/328384/328384_20100208.pdf
I also implemented a counter of downloaded files. We can get rid of it in case performance is important (it is thread safe).
Looking forward to your code review. Thanks.
404 - https://bank.gov.ua/http://site.bank.gov.ua:9091/files/Shareholders/380957/index.html Plus see my suggested changes in branch CmdPrmsConvey
On Tue, Dec 25, 2018 at 7:12 PM j-galt notifications@github.com wrote:
Hi Valeriy,
Now files are saved into a common directory ("feature/sharedStorage" branch). @"(<[^>]*>)|(\t|\n|\r)" isn't used any more. An interesting thing is that I get different number of files on different machines. On the first one I get 1497 files, on the second one I get 1501 files. The problem is that a few files aren't available on the first machine. Even if I navigate via a browser to a specific web folder on the first computer I get less files available to download than doing so on the second one. They aren't visible even in HTML markup. These files are: https://bank.gov.ua/files/Shareholders/313849/313849_20150115.pdf https://bank.gov.ua/files/Shareholders/313582/313582_20150209.pdf https://bank.gov.ua/files/Shareholders/380816/380816_20171010_1723-old.pdf https://bank.gov.ua/files/Shareholders/328384/328384_20100208.pdf I also implemented a counter of downloaded files. We can get rid of it in case performance is important (it is thread safe).
Looking forward to your code review. Thanks.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vmdrot/BGU/issues/6#issuecomment-449863702, or mute the thread https://github.com/notifications/unsubscribe-auth/AFNYLRx_kFQoPFIkHv2NWQuIowUEX8mbks5u8lxggaJpZM4Y1cT7 .
D:\git\FileDownloader>git push --set-upstream origin CmdPrmsConvey remote: Permission to j-galt/FileDownloader.git denied to vmdrot. fatal: unable to access 'https://github.com/j-galt/FileDownloader.git/': The requested URL returned error: 403
On Mon, Jan 14, 2019 at 7:26 PM Валерій Дротенко valeriy.drotenko@gmail.com wrote:
404 - https://bank.gov.ua/http://site.bank.gov.ua:9091/files/Shareholders/380957/index.html Plus see my suggested changes in branch CmdPrmsConvey
On Tue, Dec 25, 2018 at 7:12 PM j-galt notifications@github.com wrote:
Hi Valeriy,
Now files are saved into a common directory ("feature/sharedStorage" branch). @"(<[^>]*>)|(\t|\n|\r)" isn't used any more. An interesting thing is that I get different number of files on different machines. On the first one I get 1497 files, on the second one I get 1501 files. The problem is that a few files aren't available on the first machine. Even if I navigate via a browser to a specific web folder on the first computer I get less files available to download than doing so on the second one. They aren't visible even in HTML markup. These files are: https://bank.gov.ua/files/Shareholders/313849/313849_20150115.pdf https://bank.gov.ua/files/Shareholders/313582/313582_20150209.pdf https://bank.gov.ua/files/Shareholders/380816/380816_20171010_1723-old.pdf https://bank.gov.ua/files/Shareholders/328384/328384_20100208.pdf I also implemented a counter of downloaded files. We can get rid of it in case performance is important (it is thread safe).
Looking forward to your code review. Thanks.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vmdrot/BGU/issues/6#issuecomment-449863702, or mute the thread https://github.com/notifications/unsubscribe-auth/AFNYLRx_kFQoPFIkHv2NWQuIowUEX8mbks5u8lxggaJpZM4Y1cT7 .
Hi Valeriy, Fixed that exception. Actually, there are incorrect hrefs in html: Also implemented logging.
Адреса індексної сторінки: https://bank.gov.ua/control/uk/publish/article?art_id=6738234&cat_id=51342