Add Attribute to objects returned from OA search

ropensci / openalexR

Getting bibliographic records from OpenAlex

https://docs.ropensci.org/openalexR/

Other

89 stars 19 forks source link

Add Attribute to objects returned from OA search #252

Closed rkrug closed 3 weeks ago

rkrug commented 1 month ago

OpenAlex is growing permanently due to ingestion of new data as well as changing sometimes due to changes in workflows and structure (which will get less over the years). Also, new data sources are added for ingestion.

Therefore it is essential to know the time when searches on OA were made and how these exactly looked.

Therefore I would suggest this backward compatible change to the results returned from the calls to OpenAlex (oa_request, and propagating through all other functions e.g.oa_fetch and oa_snowball and also oa2df type functions):

Additional attributes for

timestamp when the API request was made and
API call which was done (for easily re-running the call at a later stage)

This would make it possible to have a trace when a certain call was executed, and also makes it possible to re-run it by using the API call attribute in the object in a oa_request call.

This relatively simple change which does not have any impact on the user, would go a long way towards reproducibility, repeatability and transparency of the searches done with openalexR.

Cheers,

Rainer

yjunechoe commented 4 weeks ago

Just curious - what kinds of output do you have in mind? Do you think you can share a reprex that modifies the current return object of e.g., oa_fetch() to include the information you mentioned?

I'm asking because AFAIK I don't think there's precedence/examples for this in other API packages (@trangdata are you aware of any?).

Also, I agree that having a record of the "reproducibility profile" of calls would be nice, but I'm not fully understanding why attributes attached to the returned object from oa_fetch() et al. should be the place where this is documented. I may be wrong in thinking this, but If a user is calling oa_fetch() and doing it in a controlled environment like rmarkdown, can't they specify the context through the date on the document, sessionInfo(), a code block echoing the exact call, verbose = TRUE, etc?

One other concern I have is whether such metadata (timestamp, the R code, the constructed query) can actually be ingested at a later point to reproduce/repeat the exact behavior of the original call. As I mentioned elsewhere, complete reproducibility in output is not feasible because we're a third party to the external OpenAlex API service, and they reserve the right to change the internals (like the behavior of certain fields) or records in the data base themselves (like what paper is associated with which ID). Not to say that such metadata are not important, but I worry about giving off the allure of reproducibility when {openalexR} alone cannot guarantee it.

My general opinion is that I'd rather err towards the side of cautioning users against assuming the usual style of reproducibility (e.g., by locking the package version), and stress the fact that this is an inevitable reality of third-party API packages. Whatever we end up doing on this point though, I agree that it should be easier for users to reason about these things - be it changes to the code or just more documentation (maybe a FAQ page?)

rkrug commented 3 weeks ago

On 10 Jun 2024, at 10:53, June Choe @.> wrote:Just curious - what kinds of output do you have in mind? Do you think you can share a reprex that modifies the current return object of e.g., oa_fetch() to include the information you mentioned?I am at the moment at a conference and will come back to that later this week - but I have some code I can show you. I'm asking because AFAIK I don't think there's precedence/examples for this in other API packages @. are you aware of any?). I agree that having a record of the "reproducibility profile" of calls would be nice, but I'm not fully understanding why attributes attached to the returned object from oa_fetch() et al. should be the place where this is documented. My reasoning is1. The actual data does not change, i.e. these changes are backward compatible2. These are metadata to the data, and attributes is, imho, the natural place to store the metadata as it does not interfere with the data and at the same time can be read easilyI may be wrong in thinking this, but If a user is calling oa_fetch() and doing it in a controlled environment like rmarkdown, can't they specify the context through the date on the document, sessionInfo(), a code block echoing the exact call, verbose = TRUE, etc?Sure - but I am only talking an out two additional pieces of metadata to be added as attributes here, and the environment in which R is running and package versions is another can of worms which is definitely not part of openalexR.The additional attributes are:1. Timestamp of the API call (info about OA data, API, etc)2. API call which returned the results (to make transparent what was actually searched for as well as making it possible to re-run the search easily - obviously within borders provided by OA)3. Now thinking about it, the openalexR version should be included as well. I can’t think about envy thing essential at the moment. One other concern I have is whether such metadata (timestamp, the R code, the constructed query) can actually be ingested at a later point to reproduce/repeat the exact behavior of the original call. As I mentioned elsewhere, complete reproducibility in output is not feasible because we're a third party to the external OpenAlex API service, and they reserve the right to change the internals (like the behavior of certain fields) or records in the data base themselves (like what a paper a Works ID points to). Not to say that such metadata are not important, but I worry about giving off the allure of reproducibility when {openalexR} alone cannot guarantee it.Let’s call it transparency. Reproducibility (in the sense of being able to re-run the code) on the openalexR side can be promoted by adding the openalexR version, while the actual OA call WILL return different results due to developing and changing OA - it might even n to run at all. But the actual transparency is essential, and by using the information, one can always re-build the call. My general opinion is that I'd rather err towards the side of caution against assuming the usual style of reproducibility (e.g., by locking the package version), and stress the fact that this is an inevitable reality of third-party API packages. Whatever we end up doing on this point though, I agree that it should be easier for users to reason about these things - be it changes to the code or just more documentation (maybe a FAQ page?)But this does not speak against the huge added transparency one would get when these attributes are added.Message ID: @.***>

yjunechoe commented 3 weeks ago

Thanks for these clarifications (and for writing this while you're conferencing!)

If I understand correctly, the purpose of the metadata is for transparency and records-keeping (and not necessarily reproducibility), and the content of this can just be timestamp and the R code / API call.

For the purposes of moving this discussion into something more concrete, I want to push back a little on the R attributes aspect of all this. Three thoughts here:

This is more of a question - how are users currently storing the results of the calls to e.g., oa_fetch()? Suppose that we add metadata to the results as attributes - will those survive upon writing out the results to some file?
I'm a little puzzled by the point about transparency because, almost definitionally, R object attributes are non-transparent - they're metadata about the object that are meant to be hidden from users, and kept only for internal uses (unless you write an getter/setter interface for them, which overengineering the problem). Attributes are notoriously finnicky to work with and the average R user does not encounter this interface in daily life, so I wonder whether this getting too into a "power-user" territory.
When/how will users consume the attributes? And I mean this in the specific sense of "in what context do you imagine you'd write code like attr(oa_fetch_output, "timestamp")? Note that attributes in R are fragile - they can get lost through data wrangling and, of course, writing to flat files. So in order to actually keep record of the metadata, they must be taken back out of the returned object almost immediately once you have it, and then have them be stored in a different way if you want them to persist.

Also, just to put the obvious back on the table, is the problem insufficiently solved by e.g., an option to message() the current time when verbose = TRUE? Or simply encouraging users to run the code in a notebook like the familiar rmarkdown if they really care about documenting their work, since that's specifically made for transparency and includes the relevant metadata by design (time and code that was run), thus trivializing these issues?

rkrug commented 3 weeks ago

Thanks for these clarifications (and for writing this while you're conferencing!)Conferences are for thinking… If I understand correctly, the purpose of the metadata is for transparency and records-keeping (and not necessarily reproducibility), and the content of this can just be timestamp and API call.Correct. And I would suggest to include the version of openalexR. For the purposes of moving this discussion into something more concrete, I want to push back a little on the R attributes aspect of all this. Three thoughts here:

This is more of a question - how are users currently storing the results of the calls to e.g., oa_fetch()? Suppose that we add metadata to the results as attributes - will those survive upon writing out the results to some file?I see this as the responsibility of the user. Also, adding a class to the results would also provide an incentive to store it as rds.

I'm a little puzzled by the point about transparency because, almost definitionally, R object attributes are non-transparent - they're metadata about the object that are meant to be hidden from users, and kept only for internal uses (unless you write an getter/setter interface for them, which overengineering the problem). Attributes are notoriously finnicky to work with and the average R user does not encounter this interface in daily life, so I wonder whether this getting too into a "power-user" territory.True. But they contain the information. Setter are definitely not needed (actually counterproductive) as these should only be set when making the call. Getters: if they have a class, a print method printing the metadata as well would be an option, otherwise information in the help. Anyway - a “normal” user is unlikely to need / ask for metadata. So I think that additional effort on the user side would be fine, but could be at a later stage be remedied by some simple additional function (maybe in an additional package?).

When/how will users consume the attributes? And I mean this in the specific sense of "in what context do you imagine you'd write code like attr(oa_fetch_output, "timestamp")? Note that attributes in R are fragile - they can get lost through data wrangling and, of course, writing to flat files. So in order to actually keep record of the metadata, they must be taken back out of the returned object almost immediately once you have it, and then have them be stored in a different way if you want them to persist.I am writing reports, and at the moment I am always saving a list with on object being the timestamp, and the second one the actual data (historical reasons). Attributes,which are carried through by the openalexR conversion functions and other openalexR functions, would make this much easier and safer. So I can always see if a result file is older, in conflicts decide which one to usage, or create time series of changes.

Also, just to put the obvious back on the table, is the problem insufficiently solved by e.g., an option to message() the current time when verbose = TRUE? Or simply encouraging users to run the code in a notebook like the familiar rmarkdown if they really care about documenting their work, since that's specifically designed for transparency and thus trivializes these issues?No. Transparency (and (short time) reproducibility) needs to put no additional burden on the user - otherwise they do not use it, work is duplicated in implementing it, it is not comparable between different projects, etc. Also, in a next step, a helper function which takes the original result object, and axacutes the API call again, could be implemented, so that repeating of the same search could be done from the results object. If I remember correctly, the package EcoSim implements the same for simulation models.—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>

yjunechoe commented 3 weeks ago

I see - I'm getting the sense that this goes beyond the simple issue of attaching attributes, namely:

1) oa_fetch() et al. should return custom classed objects 2) We should encourage users to write the output to .rds (to preserve such classed objects)

I agree that these changes would address some of the issues that attributes raise, but I worry that in doing so this design introduces greater complexity elsewhere since it breaks the simplicity of working with functions returning data frames and tibbles. If the issue of attributes necessarily involves the broader question of whether we should have a custom S3/S4 class for objects returned by {openalexR}, I think the answer is less clear cut - it depends more on the capability and design principles of the maintainers (as for my self, I typically avoid OOP if possible, as do many other lightweight API wrapper packages). In any case, I think we need to firstly resolve the prior question of whether implementing and maintaining S3/S4 style system for outputs is feasible.

As an aside, (2) is surprising to me - that we'd want to encourage users to write results to .rds. I think this will represent a move away from the practices of a typical user {openalexR}, who'd prefer to write the results in a more general format not tied to R (e.g., .csv).

Lastly, when you say:

at the moment I am always saving a list with on object being the timestamp, and the second one the actual data (historical reasons).

I imagine this to be a simple wrapper, like:

oa_fetch2 <- \(...) list(time = Sys.time(), data = oa_fetch(...))

Would you rather have this kind of output (metadata in list or attr or otherwise) be accessible via oa_fetch(output = ...) (no need for OOP)?

rkrug commented 3 weeks ago

I see - I'm getting the sense that this goes beyond the simple issue of attaching attributes, namely:

oa_fetch() et al. should return custom classed objects A S3 class is simply another attribute set by class() and it does not need to be used in the package itself, but it can form the basis of additional packages building on openalexR. So I do not see any problem here. We should encourage users to write the output to .rds This is up to the user - they have to know “there is additional info in the attributes, and only ads conserves these”. But it is the same with all other objects. Csv is not 1-to-1 . So I see no problem here. I agree that these changes would address some of the issues that attributes raise, but I worry that in doing so this design introduces greater complexity elsewhere since it breaks the simplicity of working with functions returning data frames and tibbles.

Well - if I want the simplicity, I have it still, as these are additional attributes which are simply stripped when saving to csv. On the internal side, I will send you some ideas how this can be implemented, i.e. e.g. in the conversion functions, saving the attributes at the beginning of the transformation, doing the transformation as usual, and before returning the result, write the attributes back to the result (obviously under a different class). These two additional steps can be wrapped into one or two helper functions. If the issue of attributes necessarily involves the broader question of whether we should have a custom S3/S4 class for objects returned by {openalexR}, I think the answer is less clear cut - it depends more on the capability and design principles of the maintainers (as for my self, I typically avoid OOP if possible, as do many other lightweight API wrapper packages). In any case, I think we need to firstly resolve the prior question of whether implementing and maintaining S3/S4 style system for outputs is feasible.

As I said - the classes (S3 is by far the easiest - just an additional attribute) does not introduce any big additional work to the package, as it does not need to be I=used internally, only the class should be set before returning the object. OOP can flow into the package into the package in a later stage, but it doesn’t need to be (this is the beauty of S3 c lasses). And this promise of that am object is of a certain openalexR class, makes it much easier to write add-on packages.

As an aside, (2) is surprising to me - that we'd want to encourage users to write results to .rds. I think this will represent a move away from the practices of a typical user {openalexR}, who'd prefer to write the results in a more general format not tied to R (e.g., .csv).

Again: a workflow which uses csv files, is not affected by the changes, they are completely backward compatible -but other workflows with additional standardised metadata concerning transparency will be possible as they are build in. Lastly, when you say:

at the moment I am always saving a list with on object being the timestamp, and the second one the actual data (historical reasons).

I imagine this to be a simple wrapper, like:

oa_fetch2 <- (...) list(time = Sys.time(), data = oa_fetch(...)) Would you rather have this kind of output (metadata in list or attr or otherwise) be accessible via oa_fetch(output = ...) (no need for OOP)?

No - because this is not as user-friendly as I always have to say result$data to get the data (which is also not backward compatible).

Hope this clarifies why I consider additional attributes (including S3 class, as it simply another attribute) to be the cleanest and easiest solution (as casual users do not even realise that it is there) solution.

yjunechoe commented 3 weeks ago

Sorry, one last attempt at the path of least resistance:

How about this behavior for output = "with_meta" (name tentative):

oa_fetch2 <- \(...) structure(oa_fetch(...), call = Sys.call(), timestamp = Sys.time())

Now you do not need to result$data - you can saveRDS(result) as-is. Users, should they wish for reproducibility, can simply use output = "with_meta" and save to .rds. IMO this should not be too much ask for the power users who care about object attributes being stored in the .rds files that they choose to write to. We can make sure that this output format is documented accordingly and, if further needed, add an example in the readme/vignette showing how it works.

Others can chime in, but personally I'm not fully convinced by the scope of having an S3 class as benefiting enough users - most users will still continue to write to flat files like .csv and, if they care about reproducibility/transparency, they will default to the more familiar tools for these (like running the code in rmarkdown files, producing a document that records the timestamp and the R code involved). I think your point is that this is net-zero for casual users and net-positive for power users, but my decision calculus is that this is at best net-zero (with a risk of a negative) for casual users, a small net-positive for power users (who, in their diversity, may not even agree with the kind of metadata we decide attach, requiring them to write their own custom behaviors anyways), and a non-trivial net-negative for the developers who have to implement and maintain this change in the internals for the time to come.

I'd also like to again raise the point that other API wrappers do not do this, and I think this is for good reason: it creates friction for users (users may be reluctant trust a custom classed object to behave exactly like a data frame) and for developers (having an S3 class just to store two flat attributes feels like somewhat of an overkill - typically OOP is a strategy reserved for stateful objects or domain-specific objects that extend generics with custom methods).

I'm somewhat ok with the middle ground (introducing a new output type`) but I think my mind is made up on introducing S3 to the package - so I'll now wait for others to share their own thoughts. I agree that you raise some important points about transparency, but I'm sorry that this has come down to an issue of design principle (at least for me).

rkrug commented 3 weeks ago

Sorry, one last attempt at the path of least resistance:

How about this behavior for output = "with_meta" (name tentative):

oa_fetch2 <- (...) structure(oa_fetch(...), call = Sys.call(), timestamp = Sys.time()) Now you do not need to result$data - you can saveRDS(result) as-is. Users, should they wish for reproducibility, can simply use output = "with_meta" and save to .rds. IMO this should not be too much ask for the power users who care about object attributes being stored in the .rds files that they choose to write to. We can make sure that this output format is documented accordingly and, if further needed, add an example in the readme/vignette showing how it works.

I understand your resistance to introduce this. But I see it as an investment which opens many possibilities, and keeps the simplicity of the current usage pattern.

Transparency in a package should not be something which one has to choose, it should be automatically be included. Especially if no usage pattern need to be changed and it does not incur unproportional costs for the user or the maintainer.

Others can chime in, but personally I'm not fully convinced by the scope of having an S3 class as benefiting the casual users - they will still continue to write to flat files like .csv and, if they care about reproducibility/transparency, they will default to the more familiar tools for these (like running the code in rmarkdown files, producing a document that records the timestamp and the R code involved).

I completely agree - and this is the beauty of it. No impact to the casual user, and if they “graduate” to an advanced user, their usage pattern (except saving to res instead csv) do also have not to change. I see that as a HUGE advantage. Crucially, given that the attributes are invisible, users will simply not be thinking about things like the timestamp data attached to a dataframe in ordinary usage - it'd be as if they never existed (unless you'd want a print method that shows these attributes, but that'd be like having a permanent verbose = TRUE for the output).

Well - the nibble print method also prints additional info and is quite verbose. But this is a “nice to have” but not relevant in this discussion. I'd also like to again raise the point that other API wrappers do not do this, and I think this is for good reason: it creates friction for users (users may be reluctant trust a custom classed object to behave exactly like a data frame)

Well - they do? The object would have two classes, and if no method for the first class exists, the function for the second one would be chosen (inheritance). and for developers (having an S3 class just to store two flat attributes feels like somewhat of an overkill

It is an easy investment into expandability. One example are the printing methods, one could write summary methods in new packages, etc.

typically OOP is a strategy reserved for stateful objects or domain-specific objects that extend generics with custom methods).

Typically yes -but it can also be the glue which keeps multiple packages together. I'm somewhat ok with the middle ground (introducing a new output type`) but I think my mind is made up on introducing S3 to the package - so I'll now wait for others to share their own thoughts. I agree that you raise some important points about transparency, but I'm sorry that this has come down to an issue of design principle (at least for me).

Let's continue the discussion, and I will look into drafting a pull request which clarifies my ideas and which would implement additional attributes in the output objects (which include S3 class attributes). As I said, there are no further changes needed to include thase, I do not see a significant cost, but many possibilities for growth ot=f the package - which I think it will in the same way that OpenAlex is growing.

trangdata commented 3 weeks ago

I am in 100% agreement with @yjunechoe here, especially this point:

my decision calculus is that this is at best net-zero (with a risk of a negative) for casual users, a small net-positive for power users (who, in their diversity, may not even agree with the kind of metadata we decide attach, requiring them to write their own custom behaviors anyways), and a non-trivial net-negative for the developers who have to implement and maintain this change in the internals for the time to come.

I'm closing this issue and keeping openalexR as is so we can keep maintaining its ease of use, reducing the learning curve for new users, and ensuring better maintainability and stability over time. Thank you both for this discussion.

rkrug commented 3 weeks ago

Dear TrangI do not want to continue the discussion and just express my opinion that this small change would have opened the door to further improvements to, imho, no costs for casual users and minimal costs to the maintainers. It is a pity that the inclusion of metadata to increase the transparency (and reproducibility to a certain extend) for a package which is using an open resource with the explicit goal to be transparent, is regarded as a burden to the user and the maintainers. This leads to inconsistent implementations of the metadata, customizations of code and transparency and reproducibility problems.If you change your mind and would like to follow this suggestion and to open this package to power unsers without loosing casual users, please feel free too contact me as I would be willing to help on that front.Cheers, and I will continue using your package,Rainer.I am in 100% agreement with @yjunechoe here, especially this point:

my decision calculus is that this is at best net-zero (with a risk of a negative) for casual users, a small net-positive for power users (who, in their diversity, may not even agree with the kind of metadata we decide attach, requiring them to write their own custom behaviors anyways), and a non-trivial net-negative for the developers who have to implement and maintain this change in the internals for the time to come.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>