mountetna / magma

Data server with friendly data loaders
GNU General Public License v2.0
5 stars 2 forks source link

Add a /load endpoint #84

Closed graft closed 3 years ago

graft commented 5 years ago

Currently Magma has three basic interfaces, /retrieve /query and /update. The first two query/download data, and the last uploads it. The /update endpoint can insert new records and create partial updates, and can make associations between records. It also runs validations on all changes. Since it accepts multipart posts, it can also facilitate file uploads (though this will change with the integration of Metis).

The main limitation of the /update endpoint is its highly constrained JSON input format, which is often inconvenient to generate or inaccessible as an input format. The purpose of the /load endpoint is to overcome this limitation by allowing data to be loaded using Magma::Loaders, which may operate on any input at all but usually accept a file as an argument.

When Magma is running, any number of project-specific or generic Magma::Loaders may be loaded up. To date these have been run via the command-line, which generally requires administrative access. This is a terrible bottleneck.

The purpose of the /load endpoint is merely to accept inputs for the existing set of Magma::Loaders. Since the /load endpoint is expecting file arguments, it (like the /update endpoint) should accept multipart messages.

A request to the endpoint expects { loader } and a set of named arguments to the specified loader, which may be of type string, integer, float, file.

The result of posting to the endpoint is that a loading task is created, to be handled by a loader daemon. This daemon takes each loading task in turn and runs it from the disk, the same as the command-line use of the loader. If the load fails, the daemon reports the failure in a user-accessible log. If the load succeeds, the records are updated and the daemon reports success in the same log.

In order to make this change the Loaders themselves will have to have a better-defined interface. Currently it is up to the Loader's initializer method to define the interface; however, in order for the /load endpoint to accept the proper inputs, it must know argument names and types. Consequently the existing loaders will have to be amended to include such an interface, or else be unavailable to /load.

graft commented 5 years ago

I don't think it's possible to write this endpoint so that loads run during the request cycle - although this is currently what happens with loaders that run as hooks in a model via the /update endpoint. (That hook should probably be undone, at least within the /update request cycle, and some sort of transition made to using /load exclusively to run loaders.)

However, some load operations can consume a lot of system resources (chiefly memory). Large loads might also require database indexing and other slow operations. As a result it can take sometimes minutes or hours to run a single load operation. This is a lot to ask of a single request. In addition it could mean a lot of competition from concurrent requests.

A load queue solves this problem and means that load requests can be processed in an orderly and hopefully efficient process. However, this seems to require a lot of messaging, so perhaps this endpoint has to wait until we have a messaging service in Polyphemus.

graft commented 5 years ago

I will write an initial version of this that does not rely on a messaging system; instead we can store load requests as .json files on disk. This avoids more complex data store requirements for the moment. The loader task can read and update these .json files - in the future it can replace the file store with a message endpoint without too much trouble.

The other question is where files will be stored. For the moment, again, this can be on disk. In the future the load request should accept (rather than the actual file), a request to PUT a file and gives back a place to keep it on Metis.

With these two simplifications a loading endpoint should be relatively simple to write; it would work more or less the same way the current process does.

graft commented 5 years ago

I made a new LoadRequest model to hold current requests, which contains message, status, loader, project_name and arguments.

Still to do:

1) support file arguments and persist them on disk 2) write LoadRequest#execute! which runs the task 3) write LoadRequest#cleanup! which removes any files and archives the request

graft commented 3 years ago

We're avoiding backend loaders, which can circumvent all sorts of validation requirements, in favor of a better /update API.