Add scraper plugin support

mmatyas commented 7 years ago

"Ideally, I'd expect a scraper plugin to work and communicate with the frontend asynchronously, using a well defined API for reporting the current state and progress, or to allow cancelling the scraping. This would allow me to create a nice progress bar / scraping progress interface, or send the scraping it to the background while the user can continue playing, etc.

This would probably require significant changes in most scrapers, so I'd also add a regular "start this program and wait for it to end" kind of launching too."

sselph commented 7 years ago

I'm starting to think about this more and was going to try and write up an API. I had a couple general questions.

How do you envision the communication working? You mention the frontend talking with the scraper plugin. Is the plugin the actual scraping code or that a part of the pegasus backend that communicates with the scraping code and provides an API to update the frontend?

Who is writing information to disk in the async API? ie does the scraper side of the API handle all the writing of the media to disk and updating metadata or does it send back data and pegasus writes it to the correct location?

Do you envision a more manual scraping for searching by name for things that failed auto-scraping?

mmatyas commented 7 years ago

The communication between the frontend and the scraper should happen through the backend; the backend will provide interfaces for plugin developers they'll have to implement, and forward the current status to the frontend. This would allow to run the scraping process in the background even during gameplay (if the user wants to), and would keep the concerns of the plugins and the UI separate.

The disk writing is an interesting question; it'd be useful to provide such common methods (one less thing to implement), but I think a plugin will have better knowledge about the files it's working with and what it wants to do with them, especially if it does some advanced multithreading, locking, buffered I/O, network connection management, etc.

By manual scraping, you mean running the scraping plugin for individual ROMs, with potentially modifying certain parameters or providing hints, like changing the searched file name or providing a game title/developer/etc.? Haven't planned that yet, but sounds nice!

mmatyas commented 7 years ago

Once upon a time, I wrote a multithreaded, asynchronous scraper that could be used as a library/plugin, but stopped working on it due to lack of time. It had the following API:

    /// Set the ROM directory, which will be scanned recursively by the scraper,
    /// searching for known file types. This is a required parameter.
    Launcher& setRomDir(fs::path);
    /// Set the output directory, where the database file(s) and the downloaded
    /// assets will be stored. Currently this is a required parameter.
    Launcher& setOutDir(fs::path);
    /// Enable detailed logging and save the log messages to the given file.
    Launcher& setLogPath(fs::path);

    /// Enable OpenVGDB support, using the provided database file.
    Launcher& enableOVGDB(fs::path);

    /// Set the function to call when the scaper detects a known file type,
    /// but not yet started processing it.
    Launcher& setOnTaskAdded(std::function<void(const fs::path&)>);
    /// Set the function to call when the scaper starts processing a
    /// previously detected file.
    Launcher& setOnTaskStarted(std::function<void(const fs::path&)>);
    /// Set the function to call when the currently processed file may have updated
    /// its related GameData object. The callback may receive a status message too.
    Launcher& setOnTaskUpdated(std::function<void(const GameData&, std::string)>);
    /// Set the function to call when an error has occured while processing a file.
    Launcher& setOnTaskFailed(std::function<void(const GameData&, std::string)>);
    /// Set the function to call when the scraper has successfully
    /// completed processing a file.
    Launcher& setOnTaskCompleted(std::function<void(const GameData&)>);
    /// Set the function to call when the scraper has finished processing all
    /// previously detected files, but may not finished downloading all assets.
    Launcher& setOnProcessingFinished(std::function<void()>);
    /// Set the function to call when the scraper has finished all file processing
    /// and all pending downloads, and will start writing the database files.
    Launcher& setOnDownloadsFinished(std::function<void()>);
    /// Set the function to call when the scraper has finished all pending operations.
    Launcher& setOnOutputFinished(std::function<void()>);

    /// Start the scraping process with the previously set parameters. If there is
    /// a problem with the parameters, it will throw an `std::runtime_error`.
    void launch();

I was able to separately monitor the the processing state of the individual files (eg. hashing, content detection, whatever), and the asset downloading, then I added a console progress bar and it worked nicely.

I'm just showing this as an example and base, feel free to deviate from it or share your ideas.

sselph commented 7 years ago

I guess it depends on if we go the library route for the scraper code or keep the scraper a separate binary. A library for me is not difficult to create but cross compiling is difficult. If I keep the binary separate, I can cross compile to almost everything but things are more complicated starting/stoping the scraper, etc. I guess there can be a scraper plugin interface and things could be directly integrated or have a shim library that handles converting between the library interface and the actual scraper binary interface.

I normally write APIs as protobufs so I just started writing my thoughts in a streaming gRPC but the approach seems similar. I assumed the caller knew how many files it expected to be scraped. I split the difference for writes and have the scraper write the media to disk but return the metadata and the paths to the media so the backend could store it in the database. That way pegasus could use gamelist.xml or whatever it wanted without requiring changes. Pegasus could also reorganize the media without burdening the scraper with having to accept a ton of options. https://docs.google.com/document/d/1DitnZAjdy44SRe9YrQ3G7wVRbcuT4mwiCrbp3Xg0SpI/edit?usp=sharing

For manual scraping that is exactly what I meant. It is very UI heavy so could come later.

mmatyas commented 7 years ago

Qt uses the "implementation of a virtual C++ base class" kind of plugin system, with some additional Qt-specific metadata macros; you can find the details here (the low-level API) and an example here. We'll likely need some wrapper code for that; it could be lightweight or do some more heavy tasks, whichever you prefer. If it's a separate project/file, it could also be maintained and rebuild separately eg. when the Qt version changes.

I think whether we use callback, function objects, message channels or queues, type onEvent() or emit event() is just a matter of style, the differences could be handled in the shim lib. For the plugin interface, Qt's signal/slot mechanism would be the natural choice, but I'll definitely have to try that out on a test plugin first.

Yes, I agree that Pegasus should provide the list of files; it already has the knowledge about the systems and games, so that shouldn't be the job of the scraper, thanks for pointing that out. It will likely make the per-file scraping call a lot easier too.

I thought cross compiling is easy with Go? If you can create a binary to call, or a static library to link into the plugin, it should work fine in theory. Correct me if I'm wrong, I've never used Go before.

sselph commented 7 years ago

One thing to keep in mind when providing the list of files is the bin/cue or bin/gdi type files. I parse every cue/gdi and extract the list of bins and then scrape them together. If Pegasus provides the list of files, then having it group these as a single object would be good. This type of logic could also come in handy in the UI. ie include just the cue if there are associated bins but allow bins if they don't have a cue sheet.

You are right that cross-compiling pure Go is really easy. I can compile a binary for every GOOS and GOARCH listed here https://golang.org/doc/install/source#environment. The issue comes in when compiling to a C library. I think this invokes cgo and leads to a lot of limitations because it is invoking a local C compiler(gcc).

traffisco commented 2 years ago

A good start will be a script event emitted after "reload games" so it could be possible to "manually move assets around"

mmatyas / pegasus-frontend

Add scraper plugin support #25