muldjord / skyscraper

Powerful and versatile game scraper written in c++
GNU General Public License v3.0
487 stars 128 forks source link

Allow to skip folders for Scraping #326

Closed sromeroi closed 2 years ago

sromeroi commented 2 years ago

Provide support to skip folders (and their subfolders) for both scrapping and gamelist generation

It would be nice if all the files and subfolders in a specific folder are ignored for scrapping by creating a ".skyscraper-ignore-scraping" in the folder to ignore. I don't know if it iterates the folders recursively or generates a folder list first, but it would be a matter of checking the existence of the file and then not iterating that folder or adding it to the list if it exists.

Additional context

I have an arcade machine with Retropie-CRT and the ROMs folders contain the SmokeMonster's roms packs.

As an example, this is how SNES rom packs looks:

$ ls /home/pi/RetroPie/roms/snes/ '1 US - A-E' '1 US - F-M' '1 US - N-R '1 US - S' '1 US - T-Z' '2 Europe - A-Z' '2 Japan - A-Z' '2 Other Regions - A-Z' '2 Unlicensed - A-Z' '3 Special Chip Games' '4 Betas, Prototypes, Revisions' '4 Hacks' '4 Homebrew' '4 Translations' '5 Tools & Service Test Carts'

It took several hours for skyscraper to attempt to fetch the data from the "Unlicensed", "Betas" and "Hacks" folders (they also exist for Megadrive, GB, GBA, SMS, megadrive, etc systems). They have thousands of files that don't get a match in ScreenScraper but it managed to find descriptions and screenshots for some of them.

But now, when I add a new SNES game (let's say: "Axelay (USA) [FASTROM].zip" into "1 US - A -E") and I want to re-scrap snes (just to get screenshot for a couple of new games), it gets stuck again for hours into the Hacks, Betas and Unlicensed folders, attempting to re-request again to ScreenScraper all the roms for which it didn't find any match in the previous scraps. This is a great waste of resources for the source database servers (skyscraper, arcadedb, etc).

I could move them out of the "snes" folder before scrapping, but then the screenshots/descriptions that I have for some of the games inside those folders from previous scraps would be lost when I generate game lists. This is also a manual process for each of the "systems"...

It would be insteresting if Skyscraper, when it's iterating folders, completely skips those containing a .skyscraper-ignore-scraping.

That way I could create the file in several folders and only the ones with new content would be scraped.

But those folders should NOT be ignored on gamelist generation, so that the files inside those folders are checked against the gathered data so that any data present in the cache folder is used to provide Name / Info / Screenshot for them.

Then, the typical process for a user would be:

1.- Perform a scraping for all the folders (hacks included), taking several hours. 2.- Create the .skyscraper-ignore-scraping file in those won't ever be updated. 3.- In the future, when a new game is added, scrap again and only the folders with possible new files will be scraped, much faster, because they will be folders with 99% of items already matched (and, thus, skipped), while the folders with 99% of not-found results (Hacks, Unlicensed, etc) will be skipped.

Thanks!

muldjord commented 2 years ago

Hi, Thank you for the suggestion. This is now implemented in my local development version. I just have to test it some more before I release it.

It works by looking for the file .skyscraperignore in all of subfolders, including the base input folder. If it finds one it won't include files from that folder.

I've also added the --excludefrom FILENAME option where FILENAME is a file containing a list of files that should be ignored. This can also be set in config.ini using the excludeFrom="FILENAME" option in a [main] or [PLATFORM] section.

I'll update here when I'm done testing and it is released. Version will be 3.7.0.

sromeroi commented 2 years ago

Thanks a lot! I wasn't expecting such a quick response because your github warned about this project being a bit "abandoned" because yo moved to other projects... But as the change seemed to be quick/easy to implement (at least, not too complex), I still had some hope :)

I'm looking forward to update it in my Retropie system :)

Note that now, when I add a new rom (new homebrew game, fastrom/SA1 hack, etc) to a system, I can just create .skyscraperignore in the 4-5 base folders that didn't change, and let skyscraper to iterate only the folders that changed for scraping (but ALL folders for gamelist generation).

This can save thousands of hits to "source" scraping servers :)

muldjord commented 2 years ago

...your github warned about this project being a bit "abandoned" because yo moved to other projects

Oh, it's certainly not abandoned. It's just not my main focus anymore. :) I do still update it quite regularly, especially when users point out bugs or come up with feature requests that are very useful and / or easy to implement.

This can save thousands of hits to "source" scraping servers :)

Yes. And as this is of high priority to the Skyscraper project, I decided to look into it. And yes, it was also a pretty simple thing to implement. :)

muldjord commented 2 years ago

3.7.0 now released. Let me know if you run into any issues with the new functionality.

sromeroi commented 2 years ago

Hi.

I just compiled it from sources in RetroPie-crt (v3.7.0).

Not sure what I'm doing wrong but I did:

$ touch "RetroPie/roms/snes/4 Translations/.skyscraperignore"

Then launched an scraping for SNES and it is trying to "scrap" the file:

"snes/4 Translations/Spanish/Top Gear (U) [T+Spa100%_Tanero].zip":

#546/6544, (0/546)
Elapsed time   : 00:00:38
Est. time left : 00:07:05
#547/6544 (T2) ---- Skipping game 'Top Gear (U) [T+Spa100_Spctrm]' since 'onlymissing' flag has been set ----

(It is skipping it because of "onlymissing", but the point is ... it should not even enter the folder and iterate the files in it).

Is this due to the file not being directly into "4 Translations" but in "4 Translations/Spanish/"?

I expected the file at the parent folder to exclude also children... Any idea? Or I should create the ignore dotfile in each subfolder manually?

Also, I noticed that it is not entering the normal folders... maybe the condition to skip the folder is negated in the code? Looks like is skipping the folders with no dotfile in it, and iterating the ones with the file in it :-?

muldjord commented 2 years ago

The file does not exclude children. Otherwise you wouldn't be able to include subfolders of subfolders where a file is found. I think this is the way to go, so you'd have to add a .skyscraperignore to each of those folders.

If I understand you correctly, it does not scrape /home/pi/RetroPie/roms/snes now? Maybe I flipped the boolean, I'll have to check. Thanks.

muldjord commented 2 years ago

Also, I noticed that it is not entering the normal folders... maybe the condition to skip the folder is negated in the code? Looks like is skipping the folders with no dotfile in it, and iterating the ones with the file in it :-?

Can you elaborate? I just tried scraping a folder such as /home/pi/RetroPie/roms/snes. It works fine without a .skyscraperignore and is skipped when there is a .skyscraperignore as expected. So I might not understand what you meant by the above.

sromeroi commented 2 years ago

The file does not exclude children. Otherwise you wouldn't be able to include subfolders of subfolders where a file is found. I think this is the way to go, so you'd have to add a .skyscraperignore to each of those folders.

If I understand you correctly, it does not scrape /home/pi/RetroPie/roms/snes now? Maybe I flipped the boolean, I'll have to check. Thanks.

Oh, I was expecting the file to "cut" all the possible children subfolders.

Inside 4 - Translations I have Spanish, Russian, English, and so on. So, by skipping the translations folder, I was expecting skyscraper to ignore all the translations folders for scraping without having to create an individual file in each of them.

Just to see my use case, this is the folder structure I'm working with (typical SmokeMonster's ROM pack):

/home/pi/RetroPie/roms/snes $ tree -d 
├── 1 US - A-E
├── 1 US - F-M
├── 1 US - N-R
├── 1 US - S
├── 1 US - T-Z
├── 2 Europe - A-Z
│   ├── PAL - A-J
│   ├── PAL - K-R
│   └── PAL - S-Z
├── 2 Japan - A-Z
│   ├── Japan A-C
│   ├── Japan D-F
│   ├── Japan G-I
│   ├── Japan J-L
│   ├── Japan M-O
│   ├── Japan P-R
│   ├── Japan S-Super Final
│   ├── Japan Super Fire-Syvalion
│   └── Japan T-Z
├── 2 Other Regions - A-Z
│   └── Spain
├── 2 Unlicensed - A-Z
│   └── Unpatched
├── 3 Special Chip Games
│   ├── CX4
│   ├── DSP 1-4
│   │   ├── Sort By - DSP Chip Revision
│   │   │   ├── DSP1 Chip
│   │   │   ├── DSP2 Chip
│   │   │   ├── DSP3 Chip
│   │   │   └── DSP4 Chip
│   │   └── Super Mario Kart Tracks & Hacks
│   ├── OBC-1
│   ├── SA-1
│   │   ├── Demos & Tools
│   │   ├── Dragon Ball Z - Hyper Dimension Hacks
│   │   │   └── Translations
│   │   ├── Kirby Hacks
│   │   │   └── Translations
│   │   ├── SD Gundam G Next Hacks
│   │   ├── Super Mario RPG Hacks
│   │   │   └── Translations
│   │   └── Super Mario World SA-1 Hacks
│   │       └── Contest Entries
│   ├── S-DD1
│   │   └── Patched
│   ├── S-RTC
│   ├── ST010
│   ├── Super FX
│   │   ├── Hacks
│   │   ├── Prototypes
│   │   └── Sort By - Super FX Chip Revision
│   │       ├── V1 SuperFX Mario Chip
│   │       ├── V2 SuperFX GSU-1
│   │       ├── V3 SuperFX GSU-2
│   │       └── V4 SuperFX GSU-2-SP1
│   └── Unsupported Chip Games
│       └── Sort By - Chip Type
│           ├── Campus Challenge '92
│           ├── Nintendo PowerFest '94
│           ├── SPC7110
│           ├── ST-011
│           └── ST-018
├── 4 Betas, Prototypes, Revisions
│   ├── Prototypes
│   └── Revisions
│       ├── Europe
│       ├── Japan
│       └── USA
├── 4 Hacks
│   ├── Hacks - A-R
│   ├── Hacks - S-Z
│   │   ├── Super Mario World Hacks
│   │   └── Super Metroid Hacks
│   ├── NES-to-SNES Hacks
│   ├── PAL-to-NTSC 60Hz Patched
│   │   ├── PAL-to-NTSC - A-I
│   │   ├── PAL-to-NTSC - J-R
│   │   └── PAL-to-NTSC - S-Z
│   ├── Selections
│   └── Speed Hacks (v2021-12-24)
├── 4 Homebrew
│   └── Demos & Intros
├── 4 Translations
│   ├── English
│   │   ├── English A-M
│   │   └── English N-Z
│   ├── Selections
│   └── Spanish
└── 5 Tools & Service Test Carts
    ├── BIOS
    ├── Blargg Hardware Tests
    └── CPU

Then, after scraping all 7000 files, let's say Viktor Vitela releases one of his super FastRom patches (like Castlevania's, Axelay's, etc). I would store Axelay (U) (FastROM).zip in 1 - US A-E (and other roms in their right folders) and re-scrap. But it gets stuck for hours in the Hacks, Translations, Special Chips, etc folders.

By placing .skyscraperignore files into Hacks, Translations, etc. just by creating 5 files, I expected files inside those folders (subfolders included) not to be scraped again (but iterated for gamelist generation) as I already have that data in the cache from my initial 7000-files scrapping process.

In the example above, just by placing the dotfile inside 3 Special Chip Games, that folder and subfolders would skipped as I didn't change any on them and I don't want thousands of files to be checked again.

If subfolders are not skipped too, then I have to create .dotfiles manually in all subfolders for them to be skipped.

Thanks!

sromeroi commented 2 years ago

Sorry to bother you again, I just don't understand why creating the .skyscraperignore file in a folder should not also prevent children to be scraped.

Could you share an example of use-case of being able to exclude an specific folder from scraping but do not exclude its children? Is this a typical use-case that I just didn't find because of my specific needs? (I just use SmokeMonster's rom packs for all the systems they are available to, as in this link.

I just thought the expected behaviour was to completely cut the scraping at that specific point of the directory tree while traversing it, sorry if I didn't explain it properly when I opened the feature-request.

Maybe you could just add both cases: .skyscraperignore and .skyscraperignoretree, one skipping only that folder and the other one skipping all folders starting on it.

Not sure how much effort would be, for the folder iteration it should be easy but adding a new --something option for it could require more time that the time we happily thought initially :)

Thanks!

muldjord commented 2 years ago

Yes, I will look into it when I get the time. I am not against adding a .skyscraperignoretree or similar.

sromeroi commented 2 years ago

Thanks a lot :)

muldjord commented 2 years ago

No prob. 3.7.1 now released where you can use the .skyscraperignoretree as suggested above. I've tested it a bit, but not much, so please let me know if you have issues. Thanks. :)

sromeroi commented 2 years ago

No prob. 3.7.1 now released where you can use the .skyscraperignoretree as suggested above. I've tested it a bit, but not much, so please let me know if you have issues. Thanks. :)

Hi. I just compiled and tested it. Good and bad news:

Good news: it ignored properly the folders where the dotfile is present (and their subfolders). Game scraping took just 47 seconds for all my US, JAPAN and EUROPE folders, and it scraped info for the new file I just added to them. It tried to scrap a total of 2700 files. About 10.000 files in the "ignored folders" were skipped.

Bad news: I CTRL+C'ed the gamelist generation when I noticed that it was also iterating 2700 files and not 2700+10000 = 12700. I'm afraid that the information already scraped and present in cache/ for the 10000 files in the "skipped folders" is not added to the XML files and I lose the screenshots/info of the folders ignored but previously scraped.

My question: If I let it finish the gamelist generation...

a.- will it "update" those 2700 files (and I will have the screenshots/info for 12700 files)? (The expected behavior for this Feature Request)

Or

b.- will it generate an XML file for just those recently scraped 2700 files (and I will lose in retropie the description/screenshots for 10.000 files?). (Not expected behaviour).

Thanks.

Thanks.

muldjord commented 2 years ago

Ugh... Yes, you are right. This is entirely related to which files are included overall, thus they will also be removed from the gamelist generation...

So, it will basically only be useful for files that shouldn't be scraped at all, and not your use case, where you wish to skip files that you know it can't scrape and want to exclude from a re-scrape.

EDIT: I think I might have an easy fix. I could simply create a check for whether -s MODULE is set or not. And if it is not,it will not ignore any files. That should solve the problem.

In other words: When you scrape for data, they will be ignored. But when you generate gamelists they will not. It WILL have to go through all of the files obviously when generating the gamelist. That's just how Skyscraper works. But it will be pretty fast as it's working from the cache.

sromeroi commented 2 years ago

Ugh... Yes, you are right. This is entirely related to which files are included overall, thus they will also be removed from the gamelist generation... So, it will basically only be useful for files that shouldn't be scraped at all, and not your use case.

So, are you thinking on changing this behavior? Because skipping scraping but not gamelist generation was exactly my use-case x'D

Ok, what about this suggestion:

1.- .skyscraperignore and .skyscraperignoretree files to ignore only scraping (without or with subfolders). 2.- .skyscraperskipgamelist and .skyscraperskipgamelisttree files to ignore only gamelist generation (without or with subfolders). (temptative names)

Then check for the specific dotfiles in the proper place: Check for 1.- in scraping file iteration and for 2.- in gamelist file iteration.

Then all users would see their possible use cases covered:

Best regards

muldjord commented 2 years ago

I think this is getting a little out of hand (over-designed and niche). :D I've just released 3.7.2 where it will ignore them when scraping new data, but not when generating the game lists. I think this is as far as I want to go with this functionality. :)

sromeroi commented 2 years ago

I think this is getting a little out of hand (over-designed and niche). :D I've just released 3.7.2 where it will ignore them when scraping new data, but not when generating the game lists. I think this is as far as I want to go with this functionality. :)

I completely agree, this solves my issue and it's fine that it takes advantage of existing cache data for gamelist generation, so I find it perfect while avoiding requests to sources. It is more than enough and I'm very grateful you addressed my issue and implemented this feature.

Thanks for all, I hope more people will find it useful.

sromeroi commented 2 years ago

Just to confirm that 3.7.1 is working perfectly:

Files:

Scraping:

Gamelist generation:

Perfect!

muldjord commented 2 years ago

Awesome, thank you for testing and reporting back. Have fun!