muldjord / skyscraper

Powerful and versatile game scraper written in c++
GNU General Public License v3.0
474 stars 127 forks source link

Add option to ignore pattern matched files [DONE] #188

Closed timothybrown closed 4 years ago

timothybrown commented 4 years ago

While scraping my library, I've run into an issue where some of my BIOS files are getting recognized as games. All the BIOS filenames start with a [BIOS] tag, so a CLI option that lets Skyscraper ignore any file matching a certain pattern: --ignore=[BIOS]*.* would be nice.

I know I could simply remove these BIOS files from the platform directories as they're not used by the emulators and EmulationStation ignores them, however they are part of the complete romsets, which means ClrMAME Pro will complain if they're not there.

Right now I've created a fish script that moves all BIOS files from each roms/$platform directory into a temporary location, scrapes all the platforms and then moves them back. It's more of a "duct tape" solution to the problem, so an actual option in Skyscraper would be nice. That said, consider it low priority.

muldjord commented 4 years ago

Yes, this makes sense, I'll add it to the roadmap. Thanks for the suggestion.

timothybrown commented 4 years ago

No problem, thanks for all the hard work you’ve put into the program!

Also, I don’t think it really needs to support regex or anything super complicated. So long as it can support multiple wildcards it should be sufficient. As another example, I noticed I have some C64 files that aren’t game and thus don’t have entries in any of the databases; these filenames have helpful tags Generic Word Processor (C128-Program) (NTSC).7z, so the ability to have an ignore entry with multiple wildcards like *(C128-Program)* would be super nice.

I can’t think of a situation that a simple wildcard based pattern matching system like this wouldn’t cover.

muldjord commented 4 years ago

Btw, I'm curious at to how ES acts in this regard. Ignoring these files would make it so they don't appear in the gamelist.xml. But I think ES actually shows them anyway (I'm not 100% sure about this), just without info. So they might need to be masked through ES as well.

timothybrown commented 4 years ago

Yes, it's my understanding that, by default, ES shows name only entries for all items. Then it parses the gamelist.xml file; the latter entries take precedence over the former.

This behavior can be changes with two command line switches:

--gamelist-only     - only display games defined in a gamelist.xml file.
--ignore-gamelist   - do not parse any gamelist.xml files.

I assume there's a way to have ES ignore specific files or patterns, I need to do a bit more research and figure out where it's stored. (My assumption is based on the fact that ES ignores files with the [BIOS] tag, unless the file is explicitly defined in the gamelist.xml. Unless of course this behavior is hardwired into ES.)

Either way, having the non-scraped files show up in ES isn't really that big of a deal for me. The main goal behind not scraping them is to speed up the process and not put as much strain on the servers I'm scraping from (no point in scraping files that I know won't be in the database).

muldjord commented 4 years ago

I've added both excludeFiles and includeFiles options to the available config.ini options now. They both take basic asterisk masks like *[BIOS]* and work as you would expect.

I've also added the accompanying --excludefiles "MASK" and --includefiles "MASK" command-line options (double-quotes around the mask are important to avoid weird behaviour).

It will all be in 3.3.9. It's already on master if you want to test it out. I've only tested it very slightly, but it seems to work as expected.

It actually expands to a regexp inside of Skyscraper, but I've decided against simply allowing full regexp's as they are needlessly complicated for this task (and hard to understand for the uninitiated users). So it only works with simple asterisk masks.

EDIT: Would it make sense to allow for multi-masks separated by semicolons? Like excludeFiles="*[BIOS]*;somethingelse*"?

timothybrown commented 4 years ago

Awesome, I’ll pull from master tonight and test it out.

Yes, I think it would make sense to allow multiple files to be specified, but I would use commas instead of semi-colons (because that’s traditionally how most *NIX based command line apps handle separating file names).

I would also make it so each mask needs to be in its own set of quotes, that way you could use commas as part of the mask, if needed. Like this: excludeFiles="*[BIOS]*","*(Program)*",file_1.ext,file_2.ext

(MacOS allows commas in file names, for example.)

muldjord commented 4 years ago

There are more issues to this, I'm investigating how to do this optimally. And I agree, I'd prefer commas myself but they are giving issues since they are allowed characters in filenames (and quite often used actually). Basically my command-line parser and config parser handles this differently so I'll need to ponder this a bit.

muldjord commented 4 years ago

There, I've implemented it with comma. In cases where a filename has a comma it has to be individually escaped so "*, the","Other*" becomes "*\, the,Other*" due to how bash and my config parser handles double-quotes.

timothybrown commented 4 years ago

It’s a shame you have to resort to escaping the commas inside double quotes, as that’s non-standard behavior (because things in double quotes generally don’t need to be escaped on the command line); however, looking at your parser code I totally understand why it’s needed. As long as it’s explicitly spelled out in the docs I don’t think it’ll be an issue.

I appreciate you taking the time to implement this so quickly.

(As an aside, despite spending 10 years writing C/CPP software, I primarily code in Python these days and it’s really spoiled me in terms of built-in features, so when I make a suggestion like I did in the previous post, I’m thinking about it from a Python perspective. Afterwards I remember this is CPP and there may be significantly more low level implementation details to take care of it. So I totally understand the constraints you’re working with.)

I’ll give this a test tonight and report back!

muldjord commented 4 years ago

It’s a shame you have to resort to escaping the commas inside double quotes, as that’s non-standard behavior (because things in double quotes generally don’t need to be escaped on the command line)

I completely agree, it's a sub-optimal solution, but I'm leaning towards accepting it since I believe using commas in the patterns might be edge cases.