Error when trying to scrape a manga, not sure what counts as a valid URL

TriAttack238 commented 1 year ago

Hello, I am a user of a Windows 11 x86-64 machine. I cloned the repo for this project and then turned it into an executable as specified in the readme document. I then tried to have the program scrape this manga using this URL: https://mangakakalot.to/undead-unluck-7025

However, the extraction did not complete, and the program let out this stack trace:

20:18:34 Info Novel with url https://mangakakalot.to/undead-unluck-7025 is not in database, adding it now.
20:18:34 Info Getting novel data for MangaKakalotStrategy
20:18:35 Info Response status code: OK
20:18:36 Error Error occurred while getting novel data from table of contents. Error: System.ArgumentNullException: Value cannot be null. (Parameter 'source')
   at System.Linq.ThrowHelper.ThrowArgumentNullException(ExceptionArgument argument)
   at System.Linq.Enumerable.Select[TSource,TResult](IEnumerable`1 source, Func`2 selector)
   at Benny_Scraper.BusinessLogic.Scrapers.Strategy.Impl.NovelDataInitializer.FetchContentByAttribute(Attr attr, NovelDataBuffer novelDataBuffer, HtmlDocument htmlDocument, ScraperData scraperData) in C:\Users\Sean Vo\Github_Repos_Default\Benny-Scraper\Benny-Scraper.BusinessLogic\Scrapers\Strategy\ScraperStrategy.cs:line 135
   at Benny_Scraper.BusinessLogic.Scrapers.Strategy.MangaKakalotInitializer.FetchNovelContentAsync(NovelDataBuffer novelDataBuffer, HtmlDocument htmlDocument, ScraperData scraperData, ScraperStrategy scraperStrategy) in C:\Users\Sean Vo\Github_Repos_Default\Benny-Scraper\Benny-Scraper.BusinessLogic\Scrapers\Strategy\MangaKakalotStrategy.cs:line 38
   at Benny_Scraper.BusinessLogic.Scrapers.Strategy.MangaKakalotStrategy.FetchNovelDataFromTableOfContentsAsync(HtmlDocument htmlDocument) in C:\Users\Sean Vo\Github_Repos_Default\Benny-Scraper\Benny-Scraper.BusinessLogic\Scrapers\Strategy\MangaKakalotStrategy.cs:line 95
20:18:36 Info Finished populating Novel data for Undead Unluck
20:18:36 Info Getting chapters data
20:18:36 Info Using Selenium to get chapters data
20:18:36 Error Error while getting chapters data. System.InvalidOperationException: Sequence contains no elements
   at System.Linq.ThrowHelper.ThrowNoElementsException()
   at System.Linq.Enumerable.First[TSource](IEnumerable`1 source)
   at Benny_Scraper.BusinessLogic.Scrapers.Strategy.ScraperStrategy.GetChaptersDataAsync(List`1 chapterUrls) in C:\Users\Sean Vo\Github_Repos_Default\Benny-Scraper\Benny-Scraper.BusinessLogic\Scrapers\Strategy\ScraperStrategy.cs:line 456
20:18:36 Error Exception when trying to process novel. System.InvalidOperationException: Sequence contains no elements
   at System.Linq.ThrowHelper.ThrowNoElementsException()
   at System.Linq.Enumerable.First[TSource](IEnumerable`1 source)
   at Benny_Scraper.BusinessLogic.Scrapers.Strategy.ScraperStrategy.GetChaptersDataAsync(List`1 chapterUrls) in C:\Users\Sean Vo\Github_Repos_Default\Benny-Scraper\Benny-Scraper.BusinessLogic\Scrapers\Strategy\ScraperStrategy.cs:line 456
   at Benny_Scraper.BusinessLogic.NovelProcessor.AddNewNovelAsync(Uri novelTableOfContentsUri, ScraperStrategy scraperStrategy) in C:\Users\Sean Vo\Github_Repos_Default\Benny-Scraper\Benny-Scraper.BusinessLogic\NovelProcessor.cs:line 91
   at Benny_Scraper.BusinessLogic.NovelProcessor.ProcessNovelAsync(Uri novelTableOfContentsUri) in C:\Users\Sean Vo\Github_Repos_Default\Benny-Scraper\Benny-Scraper.BusinessLogic\NovelProcessor.cs:line 62
   at Benny_Scraper.Program.RunAsync() in C:\Users\Sean Vo\Github_Repos_Default\Benny-Scraper\Benny-Scraper\Program.cs:line 105
20:18:36 Info Elapsed time: 00:00:01.8044361

As far as I can understand it, the scraper thinks that the chapter list is empty or something, but I'm not sure. Any tips to get this working?

P.S. When the scraper actually makes an epub or PDF, where does it go? Can I change the output format manually?

feahnthor commented 1 year ago

You can find issues getting data for chapters in the Benny-Scraper.BusinessLogic.Scrapers.Strategy.ScraperStrategy.cs in the FetchContentByAttribute() method, that is where all Scrapers go to get things such as the Authors, genres, descriptions. In this case it would be called from the MangaKakalotStrategy.cs FetchNovelContentAsync() method.

the issue is that the selectors to get all the chapter links to navigate to was returning null, which mean that things couldn't proceed properly. I checked the innerHtml of the documentNode and found that the element is no longer being loaded through http calls, which means I will need to make a change to either call the js that is hiding it or just use selenium like i do for mangas to load the page.

here is a quick video of the debug steps. devenv_ogAAqByGtg

For now, please use https://mangakatana.com https://mangakatana.com/manga/undead-unluck.24191. If you get stuck seeing Using Selenium to get chapters data just hit enter once

As for the location of both the Epub and PDF folder, it is stored by default in your Documents/BennyScrapedNovels/{Novel Name}. This can be found and edited in the Benny-Scraper.BusinessLogic project in the NovelProcessor.cs GetDocumentsFolder() method. I will go ahead and get the filepath to be something that the user can set on start.

Please reply if this help solve your problem. I will push a few changes up the will do a better job logging where things went wrong.

TriAttack238 commented 1 year ago

Thank you, I think it's working now! So was it an issue with how the program currently scrapes the page, due to mangakalot, or something else?

Regarding logging, it could be good to dump the appropriate logs in a human-readable format on a text document that can be preserved.

feahnthor commented 1 year ago

Thank you, I think it's working now! So was it an issue with how the program currently scrapes the page, due to mangakalot, or something else?

Regarding logging, it could be good to dump the appropriate logs in a human-readable format on a text document that can be preserved.

For the first, I would think that mangakalot made some changes to how they render things to the page. When you say it's working now, do you mean mangakalot is, or mangakatana?

All the logs are being written to appdata/roaming/bennyscraper/logs. That parent folder also contains the database as well if you wanted to view it in SQLiteStudio

TriAttack238 commented 1 year ago

Well, it started working, but after letting it run all the way through, I got a new error.

15:44:47 Error Error while getting chapters data. OpenQA.Selenium.WebDriverException: disconnected: Unable to receive message from renderer
  (failed to check if window was closed: disconnected: not connected to DevTools)
  (Session info: headless chrome=116.0.5845.140)
   at OpenQA.Selenium.WebDriver.UnpackAndThrowOnError(Response errorResponse, String commandToExecute)
   at OpenQA.Selenium.WebDriver.Execute(String driverCommandToExecute, Dictionary`2 parameters)
   at OpenQA.Selenium.WebDriver.set_Url(String value)
   at OpenQA.Selenium.Navigator.GoToUrl(String url)
   at Benny_Scraper.BusinessLogic.Scrapers.Strategy.ScraperStrategy.GetChapterDataAsync(IWebDriver driver, String urls, String tempImageDirectory) in C:\Users\Sean Vo\Github_Repos_Default\Benny-Scraper\Benny-Scraper.BusinessLogic\Scrapers\Strategy\ScraperStrategy.cs:line 682
   at Benny_Scraper.BusinessLogic.Scrapers.Strategy.ScraperStrategy.GetChaptersDataAsync(List`1 chapterUrls) in C:\Users\Sean Vo\Github_Repos_Default\Benny-Scraper\Benny-Scraper.BusinessLogic\Scrapers\Strategy\ScraperStrategy.cs:line 494
15:44:47 Error Exception when trying to process novel. System.IO.DirectoryNotFoundException: Could not find a part of the path 'C:\Users\Sean Vo\AppData\Local\Temp\2a3b33e4-4367-4c47-94f1-a46203a530b9'.
   at System.IO.FileSystem.GetFindData(String fullPath, Boolean isDirectory, Boolean ignoreAccessDenied, WIN32_FIND_DATA& findData)
   at System.IO.FileSystem.RemoveDirectory(String fullPath, Boolean recursive)
   at Benny_Scraper.BusinessLogic.Scrapers.Strategy.ScraperStrategy.GetChaptersDataAsync(List`1 chapterUrls) in C:\Users\Sean Vo\Github_Repos_Default\Benny-Scraper\Benny-Scraper.BusinessLogic\Scrapers\Strategy\ScraperStrategy.cs:line 542
   at Benny_Scraper.BusinessLogic.NovelProcessor.AddNewNovelAsync(Uri novelTableOfContentsUri, ScraperStrategy scraperStrategy) in C:\Users\Sean Vo\Github_Repos_Default\Benny-Scraper\Benny-Scraper.BusinessLogic\NovelProcessor.cs:line 91
   at Benny_Scraper.BusinessLogic.NovelProcessor.ProcessNovelAsync(Uri novelTableOfContentsUri) in C:\Users\Sean Vo\Github_Repos_Default\Benny-Scraper\Benny-Scraper.BusinessLogic\NovelProcessor.cs:line 62
   at Benny_Scraper.Program.RunAsync() in C:\Users\Sean Vo\Github_Repos_Default\Benny-Scraper\Benny-Scraper\Program.cs:line 105

feahnthor commented 1 year ago

This is harder to debug as I am not able to reproduce it, it's especially weird as there were no errors with Selenium 2 days ago when you originally opened this issue. I've made a few changes to how I dispose of the drivers. Would you be able to try again, please clean and rebuild as well, I am hoping it may be due to an outdated driver.

If the problem persists, please send me an email with your logs so I would be able to get a better idea at which step things occurred.

feahnthor commented 1 year ago

I will mark this task as closed, since it addressed the original issue. Feel free to open up a new issue if the problem still persists.

martial-god / Benny-Scraper

Error when trying to scrape a manga, not sure what counts as a valid URL #19