Closed NateWr closed 3 years ago
Is there a way to use Generators here to prevent fromRow from getting called until it's used in Collection?
I think maybe this is how it could be done?
/**
* Get a collection of contexts matching the configured query
*/
public static function getMany(Collector $query) : Collection
{
$rows = $query
->getQueryBuilder()
->select(['c.*'])
->get();
return Collection::make(function() use ($rows) {
$rows->each(function($row) {
yield static::fromRow($row);
});
});
}
@NateWr, with apologies for the delay in getting to this -- it looks very good and I'm happy to be adopting LazyCollections.
One thought, though: implementing mappings directly in the collection class is going to lead to a double-standard where there are "first class" mappings (e.g. to/from the API) implemented in the collection class, and the rest, e.g. mappings added by import/export plugins.
Our current method of getting around that is to implement each mapping as a separate class -- as in the filter framework for the native import/export plugin. There are two problems with that approach:
What do you think about an injection type of approach? Something like
$publicationCollection->addMapping(MAP_TO_SCHEMA, function($publication) {
...
});
Then mappings can be added by the application (e.g. mapping to/from the schema) or post-hoc (e.g. by a plugin) or even augmented by a plugin through a new (overriding) function that passes through to the old one internally.
Several of these mapping additions can be handled within the same chunk of code; no need to break them out to their own class, though it's still possible.
I can see the benefit of this for plugins to extend a particular mapping, but I'm worried about what we lose with the added layer of abstraction. A couple of concerns:
$collection->addMapping(TYPE, $callback)
, how would we define the dependencies that the map needs? This is a broader problem, in that the dependencies change based on which properties we are mapping. My approach above, to pass in the $props
, won't be great for this either.My initial instinct was to build a special mapping base class to support extensions through callbacks like this. But once I had done that and looked at making use of it, I realized that we don't really reuse maps very often. We map to JSON for our REST API, and we map to CSV once or twice. But a single mapping tool can't really handle both, because the JSON maps to multilingual object but the CSV collapses to a single language. Mappings seem to be kind of one-off things.
The only place we really have code overlap is between summary/full mappings of the schema and when mapping dependent objects requires fetching data. Putting this code into a Collection
class would allow us to quickly pull duplicated code into protected helper functions like this:
namespace PKP\Submission;
class Collection extends LazyCollection
{
public function mapToSchemaSummary(Request $request, Context $submissionContext) : Enumerable
{
return $this->map(function($submission) use ($request, $submissionContext) {
return $this->_mapItemToSchemaSummary($request, $submissionContext);
});
}
// Note: $authorUserGroups only needed for full schema mapping
public function mapToSchema(Request $request, Context $submissionContext, array $authorUserGroups) : Enumerable
{
return $this->map(function($submission) use ($request, $submissionContext) {
$item = $this->_mapItemToSchemaSummary($request, $submissionContext);
$props = Services::get('schema')->getFullProps(SCHEMA_SUBMISSION);
foreach ($props as $prop) {
switch($prop) {
case 'stageAssignments':
$props[$prop] = [];
$assignments = $this->_getStageAssignments($submission);
foreach ($assignments as $assignment) {
$props[$prop][] = [
'id' => $assignment->getId(),
'userId' => $assignment->getUserId(),
];
}
break;
...
}
}
return $item;
});
}
public function mapToCSV() : Enumerable
{
return $this-map(function($submission) {
$assignments = $this->_getStageAssignments($submission);
$row = [
'id' => $submission->getId(),
];
$row['assignments'] = join(',', array_map(function($assignment) {
return $assignment->getUserId();
}, $assignments));
});
}
protected function _mapItemToSchemaSummary(Request $request, Context $submissionContext) : Array
{
$item = [];
$props = Services::get('schema')->getSummaryProps(SCHEMA_SUBMISSION);
foreach ($props as $prop) {
switch($prop) {
case '_href':
$props[$prop] = $request->url(...);
break;
...
}
}
return $item;
}
protected function _getStageAssignments(Submission $submission) : Array
{
return DAORegistry::getDAO('StageAssignmentsDAO')->getBySubmissionId($submission->getId());
}
}
This makes the mappings very simple, but we can share code however we need as well as optimize it when necessary. However, it doesn't yet have a good way for plugins to extend the mappings.
Ideally, the schema system should do this for us. Plugins can extend the schema to add properties to objects. This will cover the majority of cases. However, it presumes that a plugin is extending the schema to add a property that will be saved in the database. If someone wanted to, for example, attach data to the submission API endpoint that wasn't stored in the submission_settings
table, that would be more complicated.
The challenge, which I don't have a good solution to yet, is how to let plugins hook in here. We can use a hook for each iteration, which would make it easy for plugins to attach something to it. But we need to add the hook to every map and it only permits the plugin to hook in during the loop. If a plugin is fetching data that isn't part of each object, that's probably not the most performant way to get that data. Ideally, the plugin would hook in at the collection level, fetching data for all of the collection's objects at once, rather than the per-item approach in the code above.
That makes me think that we probably want these hooks outside of the mapping itself. It may not be great for a very large collection, but it would be best for our most common use cases -- extending the API. If we are mapping large sets of objects, we probably want to split the collection into chunks anyway. And if I'm a plugin extending a mapping, I may want to extend a mapping in one instance (eg - export to CSV) and not another (eg - REST API response).
Not sure if all of that is clear at all, but that's where my thinking has gone as I've tried to work out the mapping stuff. To summarize, I think the key things I've come to feel are;
...how would we define the dependencies that the map needs?
I'm not sure the best way, but we introduced a Deployment
class for the native import/export plugin which describes overarching characteristics of the import/export process being performed -- for example the configuration, what items had already been imported (so they could be destroyed in case of failure), etc. That approach might be useful in defining the characteristics of the desired mapping, i.e. is it a deep or shallow mapping.
When would a plugin be able to add a mapping? The plugin won't intervene every time a collection is instantiated...
Sorry, this was confused by my example code above -- we'd probably need to introduce a collection factory in order to accomplish this. Then, a plugin (and the core code would do it the same way) could, for example, introduce a mapping from submission to JATS document...
class JatsMappingPlugin extends Plugin {
const MAP_TO_JATS = 'jats';
function register(...) {
$submissionCollectionFactory = CollectionFactory::get('submission');
$submissionCollectionFactory->addMapping(self::MAP_TO_JATS, function(Submission $submission, $existingMappingFunction) {
assert($existingMappingFunction === null); // This should be the first-loaded submission to JATS mapping
$dom = new DOMDocument();
$dom->appendChild(...); // Populate a DOM to represent the submission
return $dom;
});
}
}
Within the same class, we can also define mappings from $publication
to JATS and register them the same way. Then the submission to JATS mapping could call on the publication to JATS mapping to decompose the operation in a way that can be both used internally and externally.
(The details of the factory etc. are very much up in the air -- just an example.)
When applying a mapping we would use a combination of a constant describing the target mapping (MAP_TO_JATS
) and the type hinting on the transformation function's first parameter (Submission
) to identify which function to call. When adding a mapping to an existing mapping (e.g. a second call to addMapping
with MAP_TO_JATS
), the new mapping gets passed the old mapping so that it can call it.
For example, a second plugin that needed to augment the mapping (and is registered after the first using the plugin loading priority tools) can similarly:
class AugmentExistingJatsMappingPlugin extends Plugin {
function register(...) {
$submissionCollectionFactory = CollectionFactory::get('submission');
$submissionCollectionFactory->addMapping(JatsMappingPlugin::MAP_TO_JATS, function(Submission $submission, $existingMappingFunction) {
// The $existingMappingFunction should now point to the function implemented in the first function. Use it.
$dom = $existingMappingFunction($submission);
$dom->appendChild(...); // Augment the existing DOM with whatever enrichment this plugin wants to add
return $dom;
});
}
}
Some use cases for this:
Ok, I can see where you're going with this. It's similar to how a request handler passes the request/response through middleware. My two main concerns are performance and abstraction. I think the use-cases I've been focused on are for the REST API, where we really need to be able to optimize the queries involved in mapping a collection to a response. But in the use-cases you raise (JATS, OAI, import/export), performance is not as important as making the output as comprehensive as possible.
On the abstraction side, the big appeal of methods attached to a LazyCollection
is that there's very little mental overhead. We can be as simple as we want for simple objects (announcements) and grow in complexity with complex ones (submissions). But I think you've made the case that -- at least for submissions -- we're going to need robust support for plugin extensibility and a lot of different maps (I'd add Crossref/Datacite and CSV to the use-cases you listed).
Let me play around with this a bit more. At the moment I'm leaning towards having a class for each map type (JSON, CSV, JATS), but let me see how that works out.
@asmecher I think I've got a good approach now based on your last post.
A global facade/factory is used to instantiate maps (see here).
$map = Map::announcementToSchema();
$items = $map->map($announcements, $context, $request);
A single map object can be extended for one-off use-cases:
$map = Map::announcementToSchema();
$map->extend(function(array $output, Announcement $item, PKP\Announcement\Maps\Schema $map) {
$output['example'] = 'Example';
return $output;
})
->map($announcements, $context, $request);
Or a plugin can register an extension on the Map
facade/factory:
Map::extend(PKP\Announcement\Maps\Schema::class, function function(array $output, Announcement $item, PKP\Announcement\Maps\Schema $map) {
$output['example'] = 'Example';
return $output;
});
// Extension will be applied to every map instantiated through the facade/factory
$map = Map::announcementToSchema();
Each mapping is its own class. To test out this approach, I wrote examples to map announcements to the Schema
and OAI
(I know announcements aren't in OAI, just an example to check if XML construction would work).
summarize
and map
methods. See how it is used in the API for returning a collection and one announcement.map
method. I wrote a quick example to test how it would be used here.One potential downside is, because we rely on the map facade to apply extensions, a coder could accidentally use the map directly and miss out on the extensions. We'll have to be careful to reinforce usage of the facade to instantiate a map.
Does this look like a good way forward? Should we call this a Factory instead of a Facade? I'll be honest I don't have a really clear idea of what a Facade is, other than a class that provides a simple API to access something.
Looking good, @NateWr -- will this require us to implement a class for each combination of mapping and entity to be mapped? If so, what do you think about setting up mappings programatically via lambda functions rather than implementing them in classes?
That way a coder can choose to implement an entire (probably simple) set of mappings in a single class -- think e.g. of something that generates ATOM entries for a news feed -- rather than requiring an entire class for each. Or they can use a class if they like, setting it up using member functions.
Not wedded at all to this approach -- just trying to think of ways to avoid a huge set of classes to implement a relatively simple transformation (the native import/export plugin is a good example of a transformation being hard to maintain with so many classes).
I'm also wondering if it's worth reducing the reliance on static calls (Map::
) with functions named for the transformation they implement (announcementToSchema
); unless we use function call magic (__call
) it would be hard to keep these extensible. (Unless I'm misunderstanding the approach you're taking -- I don't see announcementToSchema
implemented on the Map
class, so maybe you are intending to use magic calls?)
will this require us to implement a class for each combination of mapping and entity to be mapped?
Yeah, it will.
If so, what do you think about setting up mappings programatically via lambda functions rather than implementing them in classes?
How would we define the relationship between a map and its dependencies with lambda functions? The main benefit of defining classes and methods like this is that when using them, an IDE can identify what mapping methods are available and what dependencies are expected, with useful type hinting like this:
And because the map is a class, we can pass it back to any plugin which is extending the map, so they have access to the dependencies too. (For example, this is required to get the root XML document when extending an XML map).
If there's a way to do all this with the lambda functions, I'd be happy to explore that.
just trying to think of ways to avoid a huge set of classes to implement a relatively simple transformation
I think this is kind of The Way™ with modern PHP these days. It is jarring at first, but with namespacing and autoload, the classes become a browse-able tree structure that makes the approach more manageable.
You're right, though, that there will be a lot of small entities with very simple maps. I think we will find our way to a base map class, something like the SchemaDAO
, which handles the basic mappings on its own.
I'm also wondering if it's worth reducing the reliance on static calls (
Map::
) with functions named for the transformation they implement (announcementToSchema
)
This mimics the approach I've taken with the Query
/Command
facades. These facades abstract away the ojs/pkp-lib differences, so we can make one call, Map::submissionToSchema()
, and get the appropriate class regardless of the application. We won't have use PKP\Submissions\Query;
statements strewn through the codebase that need to be updated to APP\Submissions\Query
whenever the class diverges between apps.
The other thing they do is provide some useful discovery in an IDE for new developers. I can discover what maps are available to me by exploring the facade:
I'm definitely open to alternatives here. We could shift the factory down into subclasses, so it is like SubmissionMap::toSchema()
. Or Map::submission()->toSchema()
. But I think we'll just be multiplying the classes here without much benefit. The facade itself is pretty clean and so I'd hope it would be pretty easy to maintain.
(Unless I'm misunderstanding the approach you're taking -- I don't see
announcementToSchema
implemented on theMap
class, so maybe you are intending to use magic calls?)
No, so a method like Map::announcementToSchema()
only returns the class which can perform this mapping. That class can itself have multiple mappings. For example, each schema map is likely to have a summary/full map:
$summaries = Map::announcementToSchema()->summary($collection, ...);
$fullRecords = Map::announcementToSchema()->map($collection, ...);
The global Map
factory/facade is just a convenient way to instantiate a map for the correct application with the correct extensions.
I didn't read this thoroughly yet, so I just wanted to leave the link of a library that I use in C#, perhaps there's something useful in the documentation if you're still looking for ideas: https://docs.automapper.org
Thanks @jonasraoni. It looks like there is a PHP library, automapper-plus, based on the C# library.
At first glance, my main hesitation would be stepping away from the JSON schema to define entities. The big benefit we get from this is solid documentation of the REST API out-of-the-box. It may be possible to generate this from the Dto
classes instead, but I'm not immediately sure how that would work.
The other thing I didn't see in the docs is how to map to something that isn't clearly represented by a PHP object. With JSON it's pretty easy to run json_encode()
on a flat PHP object. But what about mapping to XML or CSV? Do you know how that would work?
I think we can borrow some of the semantics of the automapper-plus library without necessarily using it directly -- the conventions for configuring mappings are essentially a better-fleshed-out version of what Nate and I were talking about above, with less dependency on static functions (which are hard to mock/test/extend). We won't necessarily have PHP objects with data represented by properties -- I suppose we could use e.g. __get
magic functions to make things transparent but if we're working towards using Laravel conventions I'd want to make sure this would be a compatible approach.
the conventions for configuring mappings are essentially a better-fleshed-out version of what Nate and I were talking about above, with less dependency on static functions (which are hard to mock/test/extend)
I can't see how to use this convention while still being able to identify dependencies for the map. If we are able to do something like:
$mapper->map($submission, SubmissionSchemaDto::class);
Where would we identify that the mapper requires the Request
object to map the URLs and the context's UserGroups
to render the author string?
Another issue is that automapper works seems to put the bulk of the "work" into the config object itself:
$config
->registerMapping(Employee::class, EmployeeDto::class)
->forMember('age', function (Employee $source) {
return date('Y') - $source->getBirthYear();
})
The forMember
method is where all of the work is. So the Dto
classes don't seem to contain much at all except a list of properties. But where would this configuration code live? If it goes into a factory class, we're going to have all of our mapping code live separately from the objects they are mapping.
And it's usually in this part of the map where we will want to be able to reduce code re-use and things. Putting it into anonymous functions may make that difficult.
We won't necessarily have PHP objects with data represented by properties -- I suppose we could use e.g. __get magic functions to make things transparent but if we're working towards using Laravel conventions I'd want to make sure this would be a compatible approach.
I think this is kind of the impasse we're at. Right now we are pretty committed to using the JSON schema files to define our entities, validate them, document them, and read/write to the database. But most PHP frameworks, including Laravel, expect the properties to be defined on the entity classes.
We can try to move in that direction, but we'll need to figure out how we allow plugins to extend entity properties, how we define the validation requirements, and how we document the entity properties. The JSON schema files give us out-of-the-box documentation for our REST API, and the documentation that's generated includes properties added by plugins. This doubles as documentation for the entities in our system. But I think that any third-party library or convention for mapping is not going to expect this.
This pattern has been adopted and merged for some entities, with other entity conversions filed and in progress. Closing.
Proposal
When we get a collection of objects from the database, we need to map these to output for the REST API based on the schema. At the moment, we use a
getProperties
method on the Service class to map the object, and any dependent objects, to the schema for the REST API (see example with Publications).However, since each property may have different dependencies, it's difficult to track these dependencies and to manage them in a performant way -- especially when the mapping works on deeply nested objects. (See example).
I set out to find a good way to implement this mapping to achieve the following:
After trying a lot of ways to overengineer a solution, I settled on a pretty simple approach that extends Laravel's Lazy Collections. This means that each collection of objects has all of the Collection methods available to it, and we can extend this with our own mapping functions.
First, each entity has its own
Collection
class which extendsLazyCollection
:When the DAO returns a collection of objects, it returns an instance of this
Collection
class:At this point, we can use all of the Laravel Collection methods on our result.
In our
\PKP\Contexts\Collection
class, we add a new method to map the object to a schema for the REST API:So now our method is discoverable like other collection methods, and all dependencies are declared
We can map to the schema:
And when we need to, we can map only the summary properties in the schema:
Because it's just a LazyCollection, it's easy for us to use Laravel's methods to write any kind of map that we want for one-off use-cases:
And even to write to a CSV:
Things get more complicated when we need to map an object that has many dependent objects, like a Submission. The number of dependencies expands, but the
mapToSchema
method is unique to eachCollection
, so all of the dependencies can be required as method arguments.In addition, we need a mechanism that permits OJS/OMP/OPS to add their own mapping for properties that are specific to the app.
The examples below show how this could work with a
Submission
. First, the baseCollection
class:Then the app-specific class which adds a property unique to OJS:
The dependent objects, like publications, are already
Collection
instances, so we can run their mapping functions and all of the required dependencies are declared:The same patterns are used then to map the
Publication
object and its dependent objects:Concerns
a. For this to work, the
DAO
method must convert the QueryBuilder's resultCollection
into aLazyCollection
like this:Will this be a performance problem on large result sets? Is there a way to use Generators here to prevent
fromRow
from getting called until it's used inCollection
?b. Are there any concerns with extending
LazyCollection
? Any chance that this will cause confusion for Laravel users or is this kind of extension of Laravel's core classes a normal thing?