Proposal: Always inline people

BrianVallelunga commented 10 years ago

The current relationships between a person and the rest of the models is unworkable in a system that allows person updates through the API. People need to be related to, but independent from our other models. Details of the problem are discussed in my email "The Tyranny of HAL." To solve this, I propose we always inline person data into all of our other models and let the server determine which relationships are present between people and the other models.

By always inlining people, we solve several issues at once:

The API is simplified
Tasks such as creating and updating a donation or event become a single API call
The API remains RESTful without needing to add action-specific endpoints (Josh's actions proposal).
Since the server will decide when to update a person and what relationships to make, there are no data integrity issues that can be caused by a naive client.
A person resource can still link to its related resources such as donations, events, etc.

I've come to this conclusion after several months of consideration and worry about data integrity and API simplification. Josh's recent suggestions regarding actions and fuzzy matching attempted to patch over some of the problems I've had with the API, but didn't solve the root issue.

This proposal may seem a bit radical, but I think it is the best way forward.

mpaquette1 commented 10 years ago

I agree with the in-lining approach and think it will result in the fewest headaches and the easiest-to-follow spec. Each API call will have at most one ID to consider when PUTting a resource.

By in-lining the relevant person data in an event record (per the example in Brian's Tyranny of HAL), the data content is much simpler to create, read and understand, compared to the the presence of a separate match object for each person that is related to the event. The in-lined data is an implied match object because the server may or may not use it to update the related records internally. But the client doesn't have to, and shouldn't, decide which related records get updated. The client is just concerned with the data at hand, e.g., about the event.

In-lining puts the onus on the server to keep things straight, and that's where it should be. The server has to defend itself against all the incompetent and malicious clients out there, so it makes sense to put all or most of the business rules on the server while relieving the client of the burden of tracking and managing related records. The server has to assume they are irresponsible anyway.

j-ro commented 10 years ago

We at the Action Network agree with all of this. We think about people in reference to actions, so you can't update the people object directly, but rather do it through actions. The server handles any matching and upserting that should or should not occur.

BrianVallelunga commented 10 years ago

Before I send a pull request for this, I'm trying to work out some of the consequences and could use some input. For reference, here's a diagram of some of our current models and relationships:

person-relationships

I think the issues we've come across with our current models are fundamentally caused by our desire for people to serve three related, but distinct goals.

The first goal is to keep track of independent, real-life people, with some sort of identity. This is the role of the people collection and of the identifiers assigned to people.

The second goal of the person relationships is to indicate which person took which action, and in which role. For example, we want to know that a single real-life "Bob Smith" organized one event, created another, made a donation, and answered a question.

The third goal is to encapsulate similar, but not necessarily identical, information about a person associated with one of the relationships. For example, biographical and employment details about a person making a donation, or contact details about a person attending an event, answering a question, etc.

Our model achieves the first two goals well. Where our model fails is in the third goal. We're assuming it is acceptable for the contact person information that describes an event organizer to be changed by the contact information added by donation, or by a survey, or any other future model with a person relationship.

One way to solve this problem is to always inline the person in the related models and to keep the people collection independent. In this way, updating a single model such as an event won't cause any external changes.

Unfortunately, if the person is treated as an inline part of the other models, this affects the second goal of tracking relationships between the independent person resources and the other models. The server can figure out these relationships and add HAL links from a person resource to the other model resources. For example:

GET /api/people/4/donations

{
    "_links": {
        "self": "http://osdi.trilogyinteractive.com/api/people/4/donations",
        "donations": [
            "href": "http://osdi.trilogyinteractive.com/api/donations/1"
        ]
    }
}

The question I have is if we should add links from the various models to the related people models? Does it seem confusing to include both the inline data and links to the related people? How do we convey that the related people are for reference only and should not be updated if the intent is to update the main model itself? Perhaps we could create a convention using the naming of the links like "creator_reference"? Example:

GET /api/events/1
{
    "identifiers": ["trilogy:1"],
    "created_at": "2013-12-12 05:00:00",
    "modified_at": "2013-12-12 05:00:00",
    "title": "Sample Event",
    "creator": {
        "given_name": "John",
        "family_name": "Smith",
        "email_addresses": [ { address: "john_smith@trilogyinteractive.com" } ]
    },
    "organizer": {
        "given_name": "Fred",
        "family_name": "Smith",
        "email_addresses": [ { address: "fred_smith@trilogyinteractive.com" } ]
    }

    "_links": {
        "self": "http://osdi.trilogyinteractive.com/api/events/1",
        "creator_reference" : "http://osdi.trilogyinteractive.com/api/people/1",
        "organizer_reference" : "http://osdi.trilogyinteractive.com/api/people/2",
    }
}

joshco commented 10 years ago

It's really goal 3 that is causing us the most difficulty.

reading data My view is that as far as reading data, we should be using restful / associated entities / linked resources (LR) / pick your term This is where HAL does its thing.

The _embedded resources provide the actual instance data of the LR for client ease. Our $expand query parameter allows the client to customize what comes back. (there is a default, like 1 level down) We might also have a smaller _embedded representation to keep from sending back too much data.
There should also be a _link to that resource in the response as well.

writing data When writing data / actions like recording a donation, the included bio information serves two possible purposes 1) To match/locate an existing resource an then link the action to that resource. eg, associate this donation with person=josh 2) to update that associated resource or create a new resource if one does not exist

This is where we discussed it being server dependent with a possible requested mode from the client.

example Given your example, it becomes:

there are some syntax issues WRT the link hash, but let's ignore those for now.

GET /api/events/1
{
    "identifiers": ["trilogy:1"],
    "created_at": "2013-12-12 05:00:00",
    "modified_at": "2013-12-12 05:00:00",
    "title": "Sample Event",
    "_embedded" : {
      "creator": {
          "given_name": "John",
          "family_name": "Smith",
          "email_addresses": [ { address: "john_smith@trilogyinteractive.com" } ]
      },
      "organizer": {
          "given_name": "Fred",
          "family_name": "Smith",
          "email_addresses": [ { address: "fred_smith@trilogyinteractive.com" } ]
      }
    }
      "_links": {
          "self": "http://osdi.trilogyinteractive.com/api/events/1",
          "creator" : "http://osdi.trilogyinteractive.com/api/people/1",
          "organizer" : "http://osdi.trilogyinteractive.com/api/people/2",
      }
  ]
}

j-ro commented 10 years ago

Does any system really work like this? I'm not talking API models, but the actual core model of the system.

For example, Catalist will allow you to store various contact details under one person, but it's not as desegregated as this. There is still some matching and updating going on. Certainly the online organizing systems like BSD, Salsa, Trilogy, the Action Network, etc... don't work like this I believe -- if you take an action with different information, it updates the core record (or just creates a new record entirely, if there's no easy way to match). I don't really think they're storing different data for each transaction over the long haul -- there's combining and updating going on to keep records in sync, minimize desegregation and database size and complexity, etc...

I'm sure the NGP side of NGP VAN stores data on donations separately for compliance purposes but otherwise, I don't think they work like this, do they?

I think use case number 3 is very edge-y, and we're trying to force systems to behave in ways that they don't currently. In fact, it's hard for me to think of another use for case 3 other than donation data for compliance. So maybe we want a new model for just pulling that data (in which case the person model isn't even really relevant -- the goal of that piece of the API would just be to pull all transaction data as it was posted into the system, so you don't need to match up to people at all really because there are other ways of pulling which people are donors and what they donated with the normal /donors call or whatever it ends up being).

I guess what I'm saying is, certainly for us at the Action Network, this is not how data is stored. I have a feeling it's not how most other systems store it either. It would be impossible for us to present data in the above model, because we don't store this data. If you post an event with different information, your core information is indeed updated. So we couldn't present both transaction data and a link to a person as different things -- they are the same for us. Consumers of our API, however, according to this spec, would be expecting different.

BrianVallelunga commented 10 years ago

Jason,

We certainly do store the transaction data separately from our main people collection. I'm surprised that other systems don't do this and that may be why I'm the most concerned about the current spec. I'll give two examples.

First, if a person signs up on a site with email and zip code, we store that in a person. We also store the raw form data. Then, later, if the same person signs up with more data, we append the person record and still save the raw form data.

The second example is in a system like events. For us, the event information is stored whole and updated independently. We can then key off of email or name and address to link an event organizer or attendee back to our people collection. As a result, we maintain the fidelity of information (from both the event and person points of view) and we get the desired relationships established.

In a system such as yours, I actually don't see an issue projecting the data to a higher level of fidelity. In your case, you could simply provide the same data in both the inline and linked representations. Behind the scenes you'd be pulling from the same person model, but it wouldn't affect your ability to present that data in the format I've suggested.

j-ro commented 10 years ago

Or, thinking another way, if you really want to allow this level of complexity, then I think you need a separate [action type]_transaction link for every action type (donations, events, etc...), in addition to the people links. This way, there are two resources for each action, the transaction level data as it was submitted and the person data the server things is related to this action. This would allow systems like ours that don't store the transaction level data to just omit it, and keep the API performing as expected across all systems.

So, like this:

GET /api/events/1
{
    "identifiers": ["trilogy:1"],
    "created_at": "2013-12-12 05:00:00",
    "modified_at": "2013-12-12 05:00:00",
    "title": "Sample Event",
    "_embedded" : {
      "creator": {
          "given_name": "John",
          "family_name": "Smith",
          "email_addresses": [ { address: "john_smith@trilogyinteractive.com" } ],
          "_links": {
             "self": {
               "href": "http://osdi.trilogyinteractive.com/api/people/1"
             }
        }
      },
      "organizer": {
          "given_name": "Fred",
          "family_name": "Smith",
          "email_addresses": [ { address: "fred_smith@trilogyinteractive.com" } ],
          "_links": {
             "self": {
               "href": "http://osdi.trilogyinteractive.com/api/people/2"
             }
      }

    }
      "_links": {
          "self": "http://osdi.trilogyinteractive.com/api/events/1",
          "creator" : "http://osdi.trilogyinteractive.com/api/people/1",
          "organizer" : "http://osdi.trilogyinteractive.com/api/people/2",
          "creator_transaction" : "http://osdi.trilogyinteractive.com/api/events/1/creator_transaction",
          "organizer_transaction" : "http://osdi.trilogyinteractive.com/api/events/1/organizer_transaction"
      }
  ]
}

In this example, I've embedded the creator/organizer person objects and linked only to the transaction info, because that seems to have broadest support. Though I suppose systems could choose to have both embedded if they want. But this way, I can just not include the creator_transaction links because I don't support them.

j-ro commented 10 years ago

I think the problem of providing the data as the same for transaction level and people level all the time, as we would need to do, is that's not what the API's docs and conventions would say it was. People would be expecting different data and it would always be the same due to how we work. Seems like that's problematic, no?

j-ro commented 10 years ago

I could even see leaving embedding off altogether, since it privileges one type of data (either core person or transaction data) over the other when we don't really know what the user making the get request cares about.

So, the convention for each action type would be you have all of the data about the action (say, title, date, location for events) at the top level, and then links to the person models and transaction data associated with them, always coming in pairs. So a more full example might be:

GET /api/events/1
{
    "identifiers": ["trilogy:1"],
    "created_at": "2013-12-12 05:00:00",
    "modified_at": "2013-12-12 05:00:00",
    "title": "Sample Event",
    "status": "confirmed",
    "location": {
        "address_lines": [
            "1806 Belmont Rd. NW #5"
        ],
        "locality": "Washington",
        "region": "DC",
        "postal_code": "20009",
        "country": "US",
        "language": "en",
        "location": {
            "latitude": 38.919,
            "longitude": -77.0379,
            "accuracy": "Approximate"
        }
    },
    "_links": {
        "self": { "href": "http://osdi.trilogyinteractive.com/api/events/1" },
        "creator" : { "href": "http://osdi.trilogyinteractive.com/api/people/1" },
        "creator_transaction" : { "href": "http://osdi.trilogyinteractive.com/api/events/1/creator_transaction" },
        "organizer" : { "href": "http://osdi.trilogyinteractive.com/api/people/2" },
        "organizer_transaction" : { "href": "http://osdi.trilogyinteractive.com/api/events/1/organizer_transaction" },
        "attendance": { "href": "http://osdi.trilogyinteractive.com/api/events/1/attendance" },
        "attendance_transactions" : { "href": "http://osdi.trilogyinteractive.com/api/events/1/attendance_transactions" }
    }
}

And then, each thing (attendance in the realm of the core people model vs. the transactional level data) is a separate thing. So I could get the attendance people as the system thinks they are, linking back to related records:

GET /api/events/1/attendance

{
    "per_page": 25,
    "page": 1,
    "total_records": 17,
    "_embedded" : {
        "osdi:attendance": [
            {       
                "identifiers": [
                    "trilogy:1"
                ],
                "status": "accepted",
                "_embedded" : {
                    "osdi:person": [
                        {
                            "given_name": "Jason",
                            "family_name": "Rosenbaum",
                            "identifiers": [
                                "trilogy:1"
                            ],
                            "created_at": "2013-07-12T22:21:05Z",
                            "modified_at": "2014-02-10T19:30:11Z",
                            "email_addresses": [
                                {
                                    "primary": true,
                                    "address": "seminal@theseminal.com"
                                }
                            ],
                            "postal_addresses": [
                                {
                                    "primary": true,
                                    "address_lines": [
                                        "1806 Belmont Rd. NW #5"
                                    ],
                                    "locality": "Washington",
                                    "region": "DC",
                                    "postal_code": "20009",
                                    "country": "US",
                                    "language": "en",
                                    "location": {
                                        "latitude": 38.919,
                                        "longitude": -77.0379,
                                        "accuracy": "Approximate"
                                    }
                                }
                            ],
                            "_links": {
                                "self": {
                                    "href": "https://osdi.trilogyinteractive.com/api/people/1"
                                },
                                "osdi:events": {
                                    "href": "https://osdi.trilogyinteractive.com/api/people/1/events"
                                }
                            }
                        }
                    ]
                },
                "_links": {
                    "self": {
                        "href": "https://osdi.trilogyinteractive.com/api/events/1/attendance/1/"
                    }
                }
            }
            ...
        ]
    }
}

Or I can get the transaction level data, basically exactly how it was posted originally, with likely a lot less data (as an event RSVP could potentially just be a name and email address, for example):

GET /api/events/1/attendance_transactions

{
    "_embedded" : [
        {
            "identifiers": [
                "trilogy:1"
            ],
            "given_name": "Jason",
            "family_name": "Rosenbaum",
            "email_addresses": [
                {
                    "address": "seminal@theseminal.com"
                }
            ],
            "_links": {
                "self": {
                    "href": "https://osdi.trilogyinteractive.com/api/events/1/attendence_transactions/1/"
                },
            }
        },
        {
            "identifiers": [
                "trilogy:2"
            ],
            "given_name": "John",
            "family_name": "Doe",
            "email_addresses": [
                {
                    "address": "john@theseminal.com"
                }
            ],
            "_links": {
                "self": {
                    "href": "https://osdi.trilogyinteractive.com/api/events/1/attendence_transactions/2/"
                },
            }
        }
        ...
    ]
}

This way, each thing is separate -- if you want the actual transaction level data as it was posted, you can call it through the event (because that's the only place it should live, as it's transaction data related to that event). If you want the people models the server thinks is related, you can call that. For systems like ours that don't save the transaction data in that way, we just won't have those links.

mpaquette1 commented 10 years ago

Our system at thedatabank updates the main person record after a careful attempt to match incoming form data to existing person data, and giving the administrator an opportunity to compare the incoming form data with existing person data before merging records. This has generally worked well, but there are still occasional problems where, due to cached form fields or forwarded personalized emails, one person's incoming data overwrites another's existing data. We have prompts on our forms to inquire when people arrive via personalized link in an email: Are you really Tammy Johnston? which people will sometimes answer wrongly. Then administrative users don't always carefully examine the data before hitting the "merge" button.

No matter what safeguards we put in, one fundamental problem is that the system's workflow presumes newer data to be authoritative simply because it's newer. This makes recovery difficult when that presumption is wrong and nobody spots it before the data is committed.

In thedatabank's system, I would like to remove that presumption of "newer-data-is-better-data," and to that end I see Brian's proposal as positive. It's a way to ensure that OSDI will support (but not dictate) a loosely coupled and robust data model. Normalized relational data model has been designed around the assumed need to optimize disk space by not storing redundant data, but disk space is rarely a limitation nowadays. Data integrity and data resilience are bigger issues now, and with inline person fields in the action models, we allow providers to address those issues (...or not) while being agnostic on the structure of the underlying database.

Mark Paquette

On Mon, Feb 17, 2014 at 7:38 AM, Brian Vallelunga notifications@github.comwrote:

Jason,

We certainly do store the transaction data separately from our main people collection. I'm surprised that other systems don't do this and that may be why I'm the most concerned about the current spec. I'll give two examples.

First, if a person signs up on a site with email and zip code, we store that in a person. We also store the raw form data. Then, later, if the same person signs up with more data, we append the person record and still save the raw form data.

The second example is in a system like events. For us, the event information is stored whole and updated independently. We can then key off of email or name and address to link an event organizer or attendee back to our people collection. As a result, we maintain the fidelity of information (from both the event and person points of view) and we get the desired relationships established.

In a system such as yours, I actually don't see an issue projecting the data to a higher level of fidelity. In your case, you could simply provide the same data in both the inline and linked representations. Behind the scenes you'd be pulling from the same person model, but it wouldn't affect your ability to present that data in the format I've suggested.

Reply to this email directly or view it on GitHubhttps://github.com/wufm/osdi-docs/issues/93#issuecomment-35257262 .

joshco commented 10 years ago

If we want to do transaction records, then i suggest we create a new resource such as either

1) donation_transaction, rsvp_transaction, x_transaction ( per resource). 2) generic transaction which has a type attribute.

Sent from mother device. Please excuse typos.

On Feb 17, 2014, at 10:49 AM, Mark Paquette notifications@github.com wrote:

Our system at thedatabank updates the main person record after a careful attempt to match incoming form data to existing person data, and giving the administrator an opportunity to compare the incoming form data with existing person data before merging records. This has generally worked well, but there are still occasional problems where, due to cached form fields or forwarded personalized emails, one person's incoming data overwrites another's existing data. We have prompts on our forms to inquire when people arrive via personalized link in an email: Are you really Tammy Johnston? which people will sometimes answer wrongly. Then administrative users don't always carefully examine the data before hitting the "merge" button.

No matter what safeguards we put in, one fundamental problem is that the system's workflow presumes newer data to be authoritative simply because it's newer. This makes recovery difficult when that presumption is wrong and nobody spots it before the data is committed.

In thedatabank's system, I would like to remove that presumption of "newer-data-is-better-data," and to that end I see Brian's proposal as positive. It's a way to ensure that OSDI will support (but not dictate) a loosely coupled and robust data model. Normalized relational data model has been designed around the assumed need to optimize disk space by not storing redundant data, but disk space is rarely a limitation nowadays. Data integrity and data resilience are bigger issues now, and with inline person fields in the action models, we allow providers to address those issues (...or not) while being agnostic on the structure of the underlying database.

Mark Paquette

On Mon, Feb 17, 2014 at 7:38 AM, Brian Vallelunga notifications@github.comwrote:

Jason,

We certainly do store the transaction data separately from our main people collection. I'm surprised that other systems don't do this and that may be why I'm the most concerned about the current spec. I'll give two examples.

First, if a person signs up on a site with email and zip code, we store that in a person. We also store the raw form data. Then, later, if the same person signs up with more data, we append the person record and still save the raw form data.

The second example is in a system like events. For us, the event information is stored whole and updated independently. We can then key off of email or name and address to link an event organizer or attendee back to our people collection. As a result, we maintain the fidelity of information (from both the event and person points of view) and we get the desired relationships established.

In a system such as yours, I actually don't see an issue projecting the data to a higher level of fidelity. In your case, you could simply provide the same data in both the inline and linked representations. Behind the scenes you'd be pulling from the same person model, but it wouldn't affect your ability to present that data in the format I've suggested.

Reply to this email directly or view it on GitHubhttps://github.com/wufm/osdi-docs/issues/93#issuecomment-35257262 .

— Reply to this email directly or view it on GitHub.

j-ro commented 10 years ago

Right -- I think option 1 makes more sense, as each transaction is going to have different fields (donations will have addresses, event RSVPs probably won't). And some things (like events), have multiple transactions per event (the creation transaction, and attendee transactions).

BrianVallelunga commented 10 years ago

I hope more people can comment on this thread to gain some additional perspectives. While new transaction resources, along with inlining for PUT and POST methods would work, it seems needlessly complex at this point.

Before heading down the path of new resource types, I'd like to understand what the downsides are to inlining people. As Marc points out, the proposal is meant to simplify the models and allow flexibility for providers. You still get the HAL relationships, along with model independence and server-matching. I don't see any major downsides to the solution and am genuinely seeking input from people that feel this isn't a good solution.

joshco commented 10 years ago

Can you clarify when you are talking about inlining if you mean in the read or write case?

Your first paragraph seems to argue against inlining (for write) due to complexity.

along with inlining for PUT and POST methods would work, it seems needlessly complex at this point.

However, the second para seems to argue for inlining

I'd like to understand what the downsides are to inlining people.

This seems like a conflict. Confused...

j-ro commented 10 years ago

@joshco, I think Brian is saying that my proposal above for separate transaction and people resources on GET is complex.

I think inlining for POST continues to make the most sense. It's simple and easily understandable on all sides. On the sending side, the sending server is simply POSTing whatever data it has about the action that was taken as the sending system understands it. In, say, the sync scenario, even with our system that doesn't save transaction level data, we'd still be POSTing essentially transaction level data because we update when new information comes in. So our server would be sending the same stuff as someone else's, which is the point. On the receiving side, the receiving server gets to decide what to do with the data and how to match it, which is also good, for data integrity reasons.

For GET, my concern -- and it's a small-ish one, but a concern -- is that the model originally proposed makes a liar out of my server in a way. The idea is that the inlined GET people data would represent transactional level data exactly as, say, the user filled out an online form, and the person links (however they are named) would represent the person resource that the server thinks matches, along with whatever info updates the server did with that and other data that came in for this person. Since we won't have transaction-level data, we'd be synthetically creating it based on the data at the person links. Unless you had special knowledge about our system to understand that the inlined data was synthetic, you'd assume that, like other systems, this was transaction level data when it wasn't. Hence the different resources to clearly mark each type of data, and to easily allow our system to simply not include the stuff we don't have as opposed to present data as one thing when it is actually another.

That said, I suppose our system could also just not include the transaction level data that's inlined in the first proposal, and only have links to the people we think are associated. This way we don't present information (that we don't have) as something that we have. So I'd be good with that too, and it is less complex than a new resource. So, an example based on the first proposal:

GET /api/events/1
{
    "identifiers": ["trilogy:1"],
    "created_at": "2013-12-12 05:00:00",
    "modified_at": "2013-12-12 05:00:00",
    "title": "Sample Event",
    "_links": {
        "self": "http://osdi.trilogyinteractive.com/api/events/1",
        "creator_reference" : "http://osdi.trilogyinteractive.com/api/people/1",
        "organizer_reference" : "http://osdi.trilogyinteractive.com/api/people/2",
    }
}

BrianVallelunga commented 10 years ago

Thanks for clarifying Jason. There seems to be little disagreement that inlining for writes and letting the server do the matching makes a great deal of sense.

Jason, I'm not sure that leaving out the transaction-level data is a good idea. Given your implementation, I wouldn't say that the model makes a liar out of your server. What I would say is that you will take all of the data sent to it, do some server-side processing, and then return what you can when requested for data in the future.

It's also worth considering this in a real-world scenario. If you're hosting your own event system, then whatever data you provide the client is by definition, correct. If you're reading in another system's event data for your own purposes, then it's unlikely that yet another third party is going to need to get that data from you in full fidelity. They could always go directly to the source if that's what they need. Finally, you can always upgrade your implementation to store all of the data at some future point without impacting any clients.

j-ro commented 10 years ago

From a technical standpoint, sure, we can always serve data in this format. But I think it's problematic that the data labeled transactional level on GET might or might not be actually transactional level, depending on the server. Changing definitions for the same objects between systems seems inherently against what OSDI is about -- a way to share common data among systems in a common way.

The real world scenario here with most impact I think would be donations -- since we don't support transactional level data, when another system reads out donation data for compliance purposes, they'll think they're getting the actual data entered at time of transaction when they're actually not. We have other ways to get that data (through our third party payment processor) but from an API standpoint, the user connecting our system with that other one would have no idea that's the case by reading the OSDI documentation, because the docs would say that this is indeed transactional level data, as entered by the user at the time of donation. Or, I guess, the docs could say that this isn't necessarily transactional level data because some systems don't support it, but that kind of makes the feature useless.

So, it seems to me that the easiest solution would be just to omit it if it doesn't exist -- that's the approach we're taking elsewhere when we don't support certain features (say, an end time on events -- our system only supports start times). That way there's no confusion possible. I guess another way could be to have a flag on the transaction data to say whether it's truly transactional or not, but that leaves the door open to many potential options on that flag besides just a straight true or false, and seems confusing.

Edited to add: This approach would also allow us to change how our system works in the future if we choose. We'd just then report back the inlined transaction data as opposed to omitting it.

tobowers commented 10 years ago

I like the idea of keeping "what was sent to the server" as part of the API. Inlining it seems to be the easiest to parse from the client stand point, but also confusing since it's a partial person. What I wouldn't do is modify that transaction data in anyway and send it back "different" to the client. That's what I'd use links and embedded for...

Basically, if I only sent email up to the person on an event then the person attribute (or, I'd like person_transaction better personally) coming down should be what was sent to the server. However, the links and possibly an embedded person coming back should be the normal HAL stuff (if the server supports that)

joshco commented 10 years ago

On the Read side

I agree that keeping the meaning of the data consistent is important. It should always be clear if I am getting transaction data or relational data.

Proposal

Transactional information gets populated into donation_transaction records, which is its own collection.

When reading the donations collection, you get the relational data (via usual HAL), but you also get links/embedded for the transactional data resource.

A client can choose what collection it wants to read from, either donation_transactions or donations.

If I was generating a compliance report, I would read from donation_transactions If I was generating a call list, or email blast list, I would read from donations

Both collections would be cross-linked so I could always navigate from one to the other if I wanted to.

j-ro commented 10 years ago

@joshco Sure, that's essentially what's here: https://github.com/wufm/osdi-docs/issues/93#issuecomment-35287903

With some additional embedding for convenience.

This would go across all actions types -- so events would have creator and creator_transaction, attendance and attedance_transactions, etc...

joshco commented 10 years ago

@j-ro Yup, looks like we're saying the same thing on the resource collections.

Re embedding, our spec currently gives the $expand parameter to ask the server for specific embeddings if the default embedding doesn't give them what they want.

BrianVallelunga commented 10 years ago

I'm still not seeing the real-world issues arising here. If an implementer can't store something like a transactional donation, then why would he even have an OSDI donations endpoint?

Topper said "if I only sent email up to the person on an event then the person attribute ... coming down should be what was sent to the server." I have two questions about this:

Where is this true anywhere else in our API? Data on a server can be changed by any number of processes and clients, not just the original creator.
In what scenario are you going to send event data to another OSDI system where you would want that system to be the authoritative source of data?

The objections raised here mostly seem valid in a scenario that doesn't seem to exist. It's not as if anyone is being forced to build a fully-featured, turn-key implementation on top of a back-end that can't fundamentally support the API.

For Jason's donation-specific example, I think he's right to leave off the inline data that he doesn't have. I'd probably leave it off and then go build in that feature to the underlying system to meet the spec. For all of the other models, I think it matters much less whether the data is transactional or not.

When conceiving of the donation model and of the inlining proposal, I'm trying to start with an API-first approach. That is to say, regardless of back-end, what's the ideal API we could conceive of? I'm afraid we're letting today's database schemas limit an API that will hopefully outlast all of our initial implementations.

joshco commented 10 years ago

I'm afraid we're letting today's database schemas limit an API that will hopefully outlast all of our initial implementations.

OSDI needs to be able to work with systems the way they work today. If implementing OSDI means that we need to rethink how we do things, then I don't think OSDI will be very successful.

Most systems I've worked with don't really deal with transactional data from an API perspective. It's more relational. Frankly, this is more useful to me than transactional data. The only case where I see transactional data being important is the donation reporting/compliance scenario. Even on that, there are different interpretations of what is really required. In other situations (like eventRsvp, personSignup) what I care about is having the most complete and fresh data for a person so I can contact them, turf-cut them, map them, recruit them, fundraise them etc. I wouldn't prioritize work so that I can see the actual data submitted for an eventRsvp.

Seems like the way Trilogy currently does things is unusual vs the other existing products. Nothing is wrong with that; OSDI needs to be able to accommodate that kind of behavior.

I'm afraid we're letting today's database schemas limit an API that will hopefully outlast all of our initial implementations.

Actually, I think it is the reverse. We're letting this transactional need which is mainly for donations complicate what might otherwise be a simpler approach for all other resources.

j-ro commented 10 years ago

To answer the question on why we'd have donations when we don't have transactional level data -- we do have some transactional level data, like who donated, how much, to whom, when, etc... What we don't have is transactional-level person data.

So, if you donate in August and enter one address as your billing, and then donate again in September and use another address, when I go and pull a record for you, I'll see both donations coming from the September address, because we have one unified person model. You'll still see transactional level data about the August and September donations with respect to amounts, recipients, etc..., but the person level data is unified.

This, of course, is bad for compliance, which is why for people doing compliance we tell them they need to pull transactional level records from WePay, which processes our credit cards. Our data is for other organizing purposes.

j-ro commented 10 years ago

And, I agree with @joshco that the relational data is more useful most of the time. That said, I'm happy to support the use case for transactional as long as it's clearly marked and optional for systems that don't have it. I think we can accomplish that with the original inline proposal or with my and @joshco's additional resource proposal -- both clearly differentiate between the different types of data, and both can be omitted for systems that don't support it.

BrianVallelunga commented 10 years ago

Thanks for the further explanations. If Jason's system is typical of the implementations out there then I see only one way forward right now, which is to just leave transaction-level data out of the spec. If we move forward with inlining data for writes (and for many people preventing writes directly to the people collection itself) we will at least have guarded against the worst data corruption issues.

I've never argued that the relationships aren't valuable, it's just that I value the transaction-level data equally. I see them as fundamentally different things and worry we're going to run into the situation of needing both.

Let's discuss more on the call tomorrow and see if we get any additional viewpoints.

j-ro commented 10 years ago

Yeah, let's discuss, though I've actually come around to the view that having it as optional wouldn't be too hard. And even I think your original proposal with some naming changes might make most sense.

Basically, since we're inlining transactional data on POST, then we should do the same on GET, if it's available.

So, on GET, if you have transactional data, you inline it, maybe under the heading creator_transaction or something. You can still link and/or embed the people resource (relational) data as normal, but this inlined data can pass along transactional stuff if you have it.

joshco commented 10 years ago

Which original? The actions or fuzzy linking?

Sent from Windows Mail

From: Jason Rosenbaum Sent: ‎Wednesday‎, ‎February‎ ‎19‎, ‎2014 ‎9‎:‎00‎ ‎AM To: wufm/osdi-docs Cc: Josh

Yeah, let's discuss, though I've actually come around to the view that having it as optional wouldn't be too hard. And even I think your original proposal with some naming changes might make most sense.

Basically, since we're inlining transactional data on POST, then we should do the same on GET, if it's available.

So, on GET, if you have transactional data, you inline it, maybe under the heading creator_transaction or something. You can still link and/or embed the people resource (relational) data as normal, but this inlined data can pass along transactional stuff if you have it.

— Reply to this email directly or view it on GitHub.

j-ro commented 10 years ago

I think what I'm talking about is closest to your comment here: https://github.com/wufm/osdi-docs/issues/93#issuecomment-35232186

So maybe the format becomes:

GET /api/events/1
{
    "identifiers": ["trilogy:1"],
    "created_at": "2013-12-12 05:00:00",
    "modified_at": "2013-12-12 05:00:00",
    "title": "Sample Event",
    "creator_transaction": {
          "given_name": "John",
          "family_name": "Smith",
          "email_addresses": [ { address: "john_smith@trilogyinteractive.com" } ]
      },
      "organizer_transaction": {
          "given_name": "Fred",
          "family_name": "Smith",
          "email_addresses": [ { address: "fred_smith@trilogyinteractive.com" } ]
      },
      "_links": {
          "self": "http://osdi.trilogyinteractive.com/api/events/1",
          "creator" : "http://osdi.trilogyinteractive.com/api/people/1",
          "organizer" : "http://osdi.trilogyinteractive.com/api/people/2",
          "attendance" : "http://osdi.trilogyinteractive.com/api/events/1/attendance"
      }
  ]
}

Then, in /attendance, you have the same thing -- each attendance would have attendance_transaction data if it exists, and the normal links to the people resource it refers to for each (with that people resource optionally _embedded too).

tobowers commented 10 years ago

I'm starting to like the way that looks @j-ro and having the transaction data be optional

BrianVallelunga commented 10 years ago

I've updated my pull request for the donation example, which I think covers what we discussed on the call. You can find it here:

https://github.com/BrianVallelunga/osdi-docs/blob/6b43549f133df267bf52e4d1171653de43bba012/donations.md

Basically, we have an optional donor field with an inline person that holds the donation data at the time of the transaction. Then we additionally have a donor_resource relationship that provides the relational person.

Additionally, I've added an employment hash to the donation that has the employer name, occupation and employment address. Per the call, I have not modified the person resource schema at all. We can discuss that with the group at large.

joshco commented 10 years ago

This is related to the point I made on the call. This pull request shows model information. Can we see what the read and write requests and the responses look like?

BrianVallelunga commented 10 years ago

Here's an updated doc with POST examples. GET requests to come, but they won't surprise anyone:

https://github.com/BrianVallelunga/osdi-docs/blob/master/donations.md

j-ro commented 10 years ago

Might want to add a version with transactional only data, no donor resource at all. But yeah, this looks like what I expected, and very nice as well!

BrianVallelunga commented 10 years ago

@j-ro Done!

itsdrewmiller commented 10 years ago

Chiming in - at NGP VAN we have pretty limited transactional history for individual resources - it's there only for some properties or only for some types of transactions depending on which application and resource type you're talking about.

I think the public value of transactional data like this is the exception rather than the rule - I don't really care what someone's phone number was 5 years ago but I may want to know what their title was when they presented before my organization. Maybe it's worth exploring where the value is in having the transactional data and see if there are specific cases where we would want to affirmatively add it to the spec?

j-ro commented 10 years ago

@itsdrewmiller We're in the same boat, and I think this proposal addresses it -- you post essentially what you have, and the receiving server handles it. So if you only have relational data, that's fine, post that. Otherwise, post more.

I think the one big use case we've identified is for donations, where it's important to have the donation data as entered by the person at the time of the donation for reporting purposes. But there are probably other use cases for other action types, so seems ok to me to add it as a general rule to the spec, making sure that it stays optional and flexible like this.

itsdrewmiller commented 10 years ago

@j-ro Even in that scenario it isn't a strong requirement - it is fine to use a person's latest information for compliance reporting (I don't think the FEC software even supports transactional person data). The only place where it can get weird is for lobbyists, but that's kind of a specific issue rather than a general transactional one.

Aside from the relational/transactional question this sounds pretty good. This doesn't replace a person endpoint completely, right? It just means posting actions (in the generic sense of the word) is always done with the action as the resource and people as properties thereof.

j-ro commented 10 years ago

@itsdrewmiller Right, this isn't a replacement for people, this is for posting and getting action records.

So, on POST, you could potentially post an action record (like a donation, with amounts, recipients, etc...) with both inline person transactional data (the info the person entered about themselves at the time of transaction) and a relational person resource (the most updated info the posting system has on the person), or one or the other. For us, it would be relational data, but some systems might want to do both, or just transactional, or whatever. The receiving server would then decide how to handle the two versions of person coming in by whatever rules the server set up.

On GET, a system could potentially show both the transactional person data and relational person data when viewing an individual action record, or one or the other, depending on how the getting system works.

But, for systems like yours and mine, we'd just be posting and getting relational data, which should all work fine.

BrianVallelunga commented 10 years ago

@itsdrewmiller I've always been told by compliance people and several donation systems that they need the data at the time of transaction for compliance purposes. I'm very surprised to hear otherwise.

itsdrewmiller commented 10 years ago

@BrianVallelunga Yeah me too, I triple checked it here because I didn't want to publicize if we were doing it wrong. :-)

opensupporter / osdi-docs

Proposal: Always inline people #93

On the Read side

Proposal