vmdrot / BGU

0 stars 0 forks source link

Структури власності банків- скачувати #6

Open vmdrot opened 5 years ago

vmdrot commented 5 years ago

Адреса індексної сторінки: https://bank.gov.ua/control/uk/publish/article?art_id=6738234&cat_id=51342

j-galt commented 5 years ago

Hi! Wasn't able to update the target framework (because of multiple errors) so implemented FileDownloader externally. Please find it here: https://github.com/j-galt/FileDownloader.git

j-galt commented 5 years ago

Hi. Replaced manual html parsing with Html Agility pack facilities. I'll add it to the BGU proj after you review the code.

vmdrot commented 5 years ago

Great! Will do... 12 груд. 2018 02:08, користувач "j-galt" notifications@github.com написав:

Hi. Replaced manual html parsing with Html Agility pack facilities. I'll add it to the BGU proj after you review the code.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vmdrot/BGU/issues/6#issuecomment-446411488, or mute the thread https://github.com/notifications/unsubscribe-auth/AFNYLSRVLtz6NgHis3laWFc6bMWNGJvRks5u4Ej5gaJpZM4Y1cT7 .

vmdrot commented 5 years ago

Hi Illia,

Mostly very fine. Not bad in any case. Just a few remarks.

  1. Please circumspect making it more robust - for instance, I've encountered an error (stack trace attached) - most probably, a temporary network (wifi, etc.) failure;
  2. It's great you've parameterized fileDownloader class; it would be nice to convey all your configurable downloader start arguments into Program.Main(string[] args) - you made no use of it. Please find example in https://usga.visualstudio.com/USGA-HCS-HandicapComputation/_git/usga.hcs.hc.qa .
  3. Regex is used to match url format - that's probably the only viable option; but using it to get the inner text of links is, surely, an overkill
    • please consider replacing below with just link.InnerText

var name = Regex.Replace(link.InnerHtml, @"(<[^>]*>)|(\t|\n|\r)", "");

Otherwise, looks just fine. Great job!

Thanks & regards, Valeriy

On Wed, Dec 12, 2018 at 9:24 AM Валерій Дротенко valeriy.drotenko@gmail.com wrote:

Great! Will do... 12 груд. 2018 02:08, користувач "j-galt" notifications@github.com написав:

Hi. Replaced manual html parsing with Html Agility pack facilities. I'll add it to the BGU proj after you review the code.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vmdrot/BGU/issues/6#issuecomment-446411488, or mute the thread https://github.com/notifications/unsubscribe-auth/AFNYLSRVLtz6NgHis3laWFc6bMWNGJvRks5u4Ej5gaJpZM4Y1cT7 .

D:\git\FileDownloader\FileDownloader\bin\Debug>FileDownloader.exe

Unhandled Exception: System.Net.WebException: An exception occurred during a WebClient request. ---> System.IO.IOException: Received an unexpected EOF or 0 bytes from the transp ort stream. at System.Net.ConnectStream.EndRead(IAsyncResult asyncResult) at System.Net.WebClient.DownloadBitsReadCallbackState(DownloadBitsState state, IAsyncResult result) --- End of inner exception stack trace --- at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.TaskAwaiter.GetResult() at WebFileDownloader.FileDownloader.d6.MoveNext() in D:\git\FileDownloader\FileDownloader\FileDownloader.cs:line 101 --- End of stack trace from previous location where exception was thrown --- at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.TaskAwaiter.GetResult() at WebFileDownloader.FileDownloader.d5.MoveNext() in D:\git\FileDownloader\FileDownloader\FileDownloader.cs:line 79 --- End of stack trace from previous location where exception was thrown --- at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.TaskAwaiter.GetResult() at WebFileDownloader.FileDownloader.d5.MoveNext() in D:\git\FileDownloader\FileDownloader\FileDownloader.cs:line 69 --- End of stack trace from previous location where exception was thrown --- at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.TaskAwaiter.GetResult() at WebFileDownloader.FileDownloader.d4.MoveNext() in D:\git\FileDownloader\FileDownloader\FileDownloader.cs:line 27 --- End of stack trace from previous location where exception was thrown --- at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.ConfiguredTaskAwaitable.ConfiguredTaskAwaiter.GetResult() at WebFileDownloader.Program.Main(String[] args) in D:\git\FileDownloader\FileDownloader\Program.cs:line 16

j-galt commented 5 years ago

Hi Valera,

  1. Implemented the exception handling mechanism. Now if the process of downloading a file fails the program is not crashed but continues downloading other files. It also performs three tries to download the file if it hasn't been obtained during the first attempt.

  2. Put those parameters into Main().

  3. The regex @"(<[^>]*>)|(\t|\n|\r)" is needed to remove tags and spaces in the names. Because there are such items (note the span tag is in innerHtml): <a href="files/Shareholders/322302/index.html">АЙБОКС БАНК (<span class=SpellE>Агрокомбанк</span>)</a><o:p></o:p></span></p> A Windows's directory can't be created with such a name: "АЙБОКС БАНК (<span class=SpellE>Агрокомбанк</span>)".

PS: implemented tests.

vmdrot commented 5 years ago

Hi Illia,

Thanks for the update - I'll take a look once my hand is back to normal. Meanwhile, the problem with directory creation is (1) specific to your solution - we don't actually need the directories (at least, for this particular case, since the names of eventual PDF files are unique (i.e.

_.pdf). Still, if you want to retain your current approach (i.e. saving each bank's belongings in a separate dir, named after the link's inner text), you can re-use the FilesFoldersLatinizer utility I've created the other day specifically for this purpose, for conversion of uk-UA to en-** locale. https://github.com/vmdrot/EVLVX_RADIO_UTILS/tree/master/FilesFoldersLatinizer Thanks & regards, Valeriy On Sat, Dec 22, 2018 at 9:44 PM j-galt wrote: > Hi Valera, > > 1. > > Implemented the exception handling mechanism. Now if the process of > downloading a file fails the program is not crashed but continues > downloading other files. It also performs three tries to download the file > if it hasn't been obtained during the first attempt. > 2. > > Put those parameters into Main(). > 3. > > The regex @"(<[^>]*>)|(\t|\n|\r)" is needed to remove tags and spaces > in the names. Because there are such items (note the span tag is in > innerHtml): > АЙБОКС БАНК ( class=SpellE>Агрокомбанк)

> A Windows's directory can't be created with such name: "АЙБОКС БАНК ( > Агрокомбанк)". > > PS: implemented tests. > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > , or mute > the thread > > . >
j-galt commented 5 years ago

Hi Valeriy,

Now files are saved into a common directory ("feature/sharedStorage" branch). @"(<[^>]*>)|(\t|\n|\r)" isn't used any more. An interesting thing is that I get different number of files on different machines. On the first one I get 1497 files, on the second one I get 1501 files. The problem is that a few files aren't available on the first machine. Even if I navigate via a browser to a specific web folder on the first computer I get less files available to download than doing so on the second one. They aren't visible even in HTML markup. These files are: https://bank.gov.ua/files/Shareholders/313849/313849_20150115.pdf https://bank.gov.ua/files/Shareholders/313582/313582_20150209.pdf https://bank.gov.ua/files/Shareholders/380816/380816_20171010_1723-old.pdf https://bank.gov.ua/files/Shareholders/328384/328384_20100208.pdf I also implemented a counter of downloaded files. We can get rid of it in case performance is important (it is thread safe).

Looking forward to your code review. Thanks.

vmdrot commented 5 years ago

404 - https://bank.gov.ua/http://site.bank.gov.ua:9091/files/Shareholders/380957/index.html Plus see my suggested changes in branch CmdPrmsConvey

On Tue, Dec 25, 2018 at 7:12 PM j-galt notifications@github.com wrote:

Hi Valeriy,

Now files are saved into a common directory ("feature/sharedStorage" branch). @"(<[^>]*>)|(\t|\n|\r)" isn't used any more. An interesting thing is that I get different number of files on different machines. On the first one I get 1497 files, on the second one I get 1501 files. The problem is that a few files aren't available on the first machine. Even if I navigate via a browser to a specific web folder on the first computer I get less files available to download than doing so on the second one. They aren't visible even in HTML markup. These files are: https://bank.gov.ua/files/Shareholders/313849/313849_20150115.pdf https://bank.gov.ua/files/Shareholders/313582/313582_20150209.pdf https://bank.gov.ua/files/Shareholders/380816/380816_20171010_1723-old.pdf https://bank.gov.ua/files/Shareholders/328384/328384_20100208.pdf I also implemented a counter of downloaded files. We can get rid of it in case performance is important (it is thread safe).

Looking forward to your code review. Thanks.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vmdrot/BGU/issues/6#issuecomment-449863702, or mute the thread https://github.com/notifications/unsubscribe-auth/AFNYLRx_kFQoPFIkHv2NWQuIowUEX8mbks5u8lxggaJpZM4Y1cT7 .

vmdrot commented 5 years ago

D:\git\FileDownloader>git push --set-upstream origin CmdPrmsConvey remote: Permission to j-galt/FileDownloader.git denied to vmdrot. fatal: unable to access 'https://github.com/j-galt/FileDownloader.git/': The requested URL returned error: 403

On Mon, Jan 14, 2019 at 7:26 PM Валерій Дротенко valeriy.drotenko@gmail.com wrote:

404 - https://bank.gov.ua/http://site.bank.gov.ua:9091/files/Shareholders/380957/index.html Plus see my suggested changes in branch CmdPrmsConvey

On Tue, Dec 25, 2018 at 7:12 PM j-galt notifications@github.com wrote:

Hi Valeriy,

Now files are saved into a common directory ("feature/sharedStorage" branch). @"(<[^>]*>)|(\t|\n|\r)" isn't used any more. An interesting thing is that I get different number of files on different machines. On the first one I get 1497 files, on the second one I get 1501 files. The problem is that a few files aren't available on the first machine. Even if I navigate via a browser to a specific web folder on the first computer I get less files available to download than doing so on the second one. They aren't visible even in HTML markup. These files are: https://bank.gov.ua/files/Shareholders/313849/313849_20150115.pdf https://bank.gov.ua/files/Shareholders/313582/313582_20150209.pdf https://bank.gov.ua/files/Shareholders/380816/380816_20171010_1723-old.pdf https://bank.gov.ua/files/Shareholders/328384/328384_20100208.pdf I also implemented a counter of downloaded files. We can get rid of it in case performance is important (it is thread safe).

Looking forward to your code review. Thanks.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vmdrot/BGU/issues/6#issuecomment-449863702, or mute the thread https://github.com/notifications/unsubscribe-auth/AFNYLRx_kFQoPFIkHv2NWQuIowUEX8mbks5u8lxggaJpZM4Y1cT7 .

j-galt commented 5 years ago

Hi Valeriy, Fixed that exception. Actually, there are incorrect hrefs in html: image Also implemented logging.