thoth-pub / thoth

Metadata management and dissemination system for Open Access books
https://thoth.pub
Apache License 2.0
44 stars 8 forks source link

Reduce list of work_status, require publication date for Active Works in Thoth #595

Open brendan-oconnell opened 2 months ago

brendan-oconnell commented 2 months ago

If a Work in Thoth has a work_status of ACTIVE, require publication_date.

ja573 commented 2 months ago

RFC @brendan-oconnell @rhigman

Reduce the list of work_status to ACTIVE and INACTIVE (both for post-publication), and FORTHCOMING, CANCELLED and POSTPONED_INDEFINITELY (for pre-publication)

Setting the following constraints on dates

publication_date

Must have (<= today)

Can have (>= today)

Must not have

withdrawn_date

Must have (<= today)

Must not have

Deprecate

Get rid of the following codes completely as they're either redundant or not relevant. When set, replace them with INACTIVE

brendan-oconnell commented 2 months ago

@ja573 I agree with deprecating the codes you mention. The impact would be minimal; currently in Thoth, there are 2 Works WITHDRAWN_FROM_SALE , 13 that are OUT_OF_PRINT, and the rest of the codes have no Works associated with them.

I still wonder about requiring publication_date for FORTHCOMING works... I really don't know enough about how publishers use Thoth to know if this would pose a problem for them. I suppose it would always be possible to set a dummy publication_date, then the publisher hopefully updates it if necessary when they change the work_status to ACTIVE.

rhigman commented 2 months ago

Might interact with #585.

Not sure about publication_date being mandatory for FORTHCOMING. From the point of view of Thoth as a metadata management system i.e. somewhere where users can "draft" records for potential book projects from an early stage, FORTHCOMING is the most sensible status for such drafts. Forcing users to set a publication date as soon as they create a record would introduce friction, and increase the issues we already see with "fake" dates causing confusion.

(ETA @brendan-oconnell haha, snap - I spent too long composing my draft!)

brendan-oconnell commented 2 months ago

One more piece of data: Currently 178 FORTHCOMING Works in Thoth with no publication_date vs. 12 that have a publication_date

rhigman commented 2 months ago

One more piece of data: Currently 178 FORTHCOMING Works in Thoth with no publication_date vs. 12 that have a publication_date

Good point - setting the default value in the migration would be a can of worms.

ja573 commented 2 months ago

Yeah, I'm also divided about FORTHCOMING... The reason for proposing it was that technically, a date is required by ONIX, which is understood as the expected publication date. But yes – we should avoid dummy dates

rhigman commented 2 months ago

Yeah, I'm also divided about FORTHCOMING... The reason for proposing it was that technically, a date is required by ONIX, which is understood as the expected publication date. But yes – we should avoid dummy dates

Yes, and it would definitely become an issue if we did start regularly disseminating ONIX records prior to publication. (Although, in that case, the individual exports should be preventing the creation of ONIX files where the platform requires a publication date and the record doesn't have it.)

brendan-oconnell commented 2 months ago

Do we currently disseminate ONIX records prior to publication, and if not, is that something that's important to Thoth users?

brendan-oconnell commented 2 months ago

Might interact with #585.

Not sure about publication_date being mandatory for FORTHCOMING. From the point of view of Thoth as a metadata management system i.e. somewhere where users can "draft" records for potential book projects from an early stage, FORTHCOMING is the most sensible status for such drafts. Forcing users to set a publication date as soon as they create a record would introduce friction, and increase the issues we already see with "fake" dates causing confusion.

(ETA @brendan-oconnell haha, snap - I spent too long composing my draft!)

early May? I'm out next week, but will continue working on this when I get back.

ja573 commented 2 months ago

No, at least for now we'll only distribute post-publication. But (a) we might do in the future (e.g. if we integrate platforms like lightning source) and (b) we don't know how people might use the records we output (we may not be pushing them, but people might be harvesting them).

ja573 commented 2 months ago

I think at present the only ONIX output you can generate pre-publication is Thoth's – and since it's meant to be the full implementation of ONIX, we should enforce having a publication_date set for forthcoming books

rhigman commented 2 months ago

Might interact with #585. Not sure about publication_date being mandatory for FORTHCOMING. From the point of view of Thoth as a metadata management system i.e. somewhere where users can "draft" records for potential book projects from an early stage, FORTHCOMING is the most sensible status for such drafts. Forcing users to set a publication date as soon as they create a record would introduce friction, and increase the issues we already see with "fake" dates causing confusion. (ETA @brendan-oconnell haha, snap - I spent too long composing my draft!)

early May? I'm out next week, but will continue working on this when I get back.

Sorry, I don't follow...

brendan-oconnell commented 2 months ago

Might interact with #585. Not sure about publication_date being mandatory for FORTHCOMING. From the point of view of Thoth as a metadata management system i.e. somewhere where users can "draft" records for potential book projects from an early stage, FORTHCOMING is the most sensible status for such drafts. Forcing users to set a publication date as soon as they create a record would introduce friction, and increase the issues we already see with "fake" dates causing confusion. (ETA @brendan-oconnell haha, snap - I spent too long composing my draft!)

early May? I'm out next week, but will continue working on this when I get back.

Sorry, I don't follow...

It seems like it's me who didn't follow what ETA meant in this context... I thought you meant "estimated time of arrival" :)

rhigman commented 2 months ago

It seems like it's me who didn't follow what ETA meant in this context... I thought you meant "estimated time of arrival" :)

Ah, my fault! I was using it as "edited to add" - just to acknowledge that you made a very similar point in the time it took me to post mine :smile: - should have avoided that ambiguity!

rhigman commented 2 months ago

I think at present the only ONIX output you can generate pre-publication is Thoth's – and since it's meant to be the full implementation of ONIX, we should enforce having a publication_date set for forthcoming books

Hmm, actually, the current implementation of onix::thoth is very permissive in terms of still letting you output something even if the record is incomplete. During development, I'd been thinking of it more as a way to get one's entire record "out" of Thoth in a familiar/standard format. (All of the high-level mandatory ONIX fields are already mandatory within Thoth, so that's not a concern, but identifying this kind of interaction between fields would have required a lot of close-reading.) Not that we can't change it.

In practice, I think you can output all the other ONIX flavours pre-publication except for Google Books and Overdrive - those are the only ones which explicitly mandate a publication date.

rhigman commented 2 months ago

One more piece of data: Currently 178 FORTHCOMING Works in Thoth with no publication_date vs. 12 that have a publication_date

Odd - I make it 79 vs 21.

And 14 of those 21 dates are in the past!

brendan-oconnell commented 2 months ago

One more piece of data: Currently 178 FORTHCOMING Works in Thoth with no publication_date vs. 12 that have a publication_date

Odd - I make it 79 vs 21.

And 14 of those 21 dates are in the past!

My fault... I looked at my (out-of-date) development data dump, instead of the production database! It seems like the proportion of FORTHCOMING works with publication_date vs. none from my figures somewhat hold though.

I also did notice a lot of Forthcoming dates in the past...

brendan-oconnell commented 2 months ago

OK, so to sum up, this kind of gets to a tension between Thoth-as-metadata-management system vs. -dissemination system.

As @rhigman notes above, publishers using Thoth as a metadata management system seem to want a kind of 'draft' record state, and they also seem to be currently using FORTHCOMING for this, as indicated by the relatively large number of FORTHCOMING works with no publication_date. If we create a catchall INACTIVE status as @ja573 has proposed, they could use this for 'drafts' of any kind, although the term "Inactive" has a different definition in ONIX Codes for Publishing Status: "The product was active, but is now permanently or indefinitely inactive in the sense that the publisher will not accept orders for it, though stock may still be available elsewhere in the supply chain." I'm not sure how important/well-known those ONIX Codes are to publishers?

So this would seem to be an argument for making publication_date optional for FORTHCOMING.

On the other hand, we want Thoth as a dissemination system to be able to disseminate successfully as much as possible, and not requiring FORTHCOMING works to have a publication_date would prevent some ONIX outputs, some of the time.

What's the best way to proceed with this decision? I have the least experience and domain-specific knowledge of anyone on this project, so I don't want to make the decision myself :) Do we need to discuss at a future Thoth meeting? In any case, it seems like this question of how publishers are creating 'drafts' in Thoth is worth digging into further...

ja573 commented 2 months ago

Based on those five statuses, the ideal usage would be that books start as FORTHCOMING and then follow:

graph TD;
    FORTHCOMING -->|Postponed Indefinitely| POSTPONED_INDEFINITELY;
    POSTPONED_INDEFINITELY -.->|Resumed| FORTHCOMING;
    FORTHCOMING -->|Cancelled| CANCELLED;
    FORTHCOMING -->|Published| ACTIVE;
    ACTIVE -.->|Withdrawn| INACTIVE;
ja573 commented 2 months ago

Then, if we agree on reducing the status to just those 5, we need to look at what constraints ONIX has between these statuses and other fields and implement them accordingly

amandasramalho commented 2 months ago

My view on this topic: since 2020 at SciELO Books we have started working with books that will be released. This means that the publication date is usually in the future and ONIX is sent in advance to Kobo, Amazon and Google so that the book is available as a ‘pre-release’. As a result, the book is listed in the catalogues, but the files are only released on the day specified as the publication date. This also means that the entire set of metadata is prepared beforehand, but without the release date. The date is entered when it is set by the publisher and then the metadata is exported in ONIX.

rupertgatti commented 2 months ago

Publishers using Thoth as a metadata management tool will have statuses FORTHCOMING (publication date NOT known) and FORTHCOMING (publication date know) which are still not resolved in that flow @ja573, and so the basic issue remains! If a publication date is 'required' for FORTHCOMING then publishers will be forced to input a made-up date - and if ONIX is then successfully distributed 'false' data is entered into various distribution systems (as well as Thoth) - in addition, it is unlikely that publishers will check if the inputted date has passed, again causing issues if distributed. So - if publication date is enforced for FORTHCOMING, then I think we need a different name for a status where the publication date has not been determined. And if FORTHCOMING does not require a publication date we need something which flag that the ONIX is not well formatted (as it is missing data) and/or prevent distribution of ONIX files to platforms that require a publication date. Presumably we will need to have a flag/hold when trying to distribute a Forthcoming work with a past publication date in any case - so I guess I would prefer to add a check for existence of a publication date at the same point rather than create a new work status.

ja573 commented 2 months ago

Publishers using Thoth as a metadata management tool will have statuses FORTHCOMING (publication date NOT known) and FORTHCOMING (publication date know) which are still not resolved in that flow @ja573, and so the basic issue remains! If a publication date is 'required' for FORTHCOMING then publishers will be forced to input a made-up date - and if ONIX is then successfully distributed 'false' data is entered into various distribution systems (as well as Thoth) - in addition, it is unlikely that publishers will check if the inputted date has passed, again causing issues if distributed. So - if publication date is enforced for FORTHCOMING, then I think we need a different name for a status where the publication date has not been determined. And if FORTHCOMING does not require a publication date we need something which flag that the ONIX is not well formatted (as it is missing data) and/or prevent distribution of ONIX files to platforms that require a publication date. Presumably we will need to have a flag/hold when trying to distribute a Forthcoming work with a past publication date in any case - so I guess I would prefer to add a check for existence of a publication date at the same point rather than create a new work status.

The original idea was to required the publication date, but after the discussion it was clear that we should not be doing that, and just leave it to the onix output to complain about it not being set.

Those who choose to enter a publication date for forthcoming titles (which is already possible) would need to check that the date is to some extent accurate, as we don't currently have any mechanisms to check the veracity of data that's input. But because this date is meant to be an estimate anyway, I don't think it'll be a problem if it's not completely accurate.

At some point we could write notifications to publishers informing them of forthcoming books with dates in the past, though

tosteiner commented 2 months ago

Apologies, this may be slightly adjacent to the core discussion here - if I understood things correctly, we are also considering to make our metadata Crossmark-compliant (see also #582 ) ... Now, with regards to updates to Work Status, Crossmark categorisation of 12 different changes to a given Work Status might be relevant here as well (if we were to implement those): https://www.crossref.org/documentation/crossmark/participating-in-crossmark/#00279

ja573 commented 1 month ago
graph TD;
    FORTHCOMING -->|Postponed Indefinitely| POSTPONED_INDEFINITELY;
    POSTPONED_INDEFINITELY -.->|Resumed| FORTHCOMING;
    FORTHCOMING -->|Cancelled| CANCELLED;
    FORTHCOMING -->|Published| ACTIVE;
    ACTIVE -.->|Require removal| WITHDRAWN;
    ACTIVE -.->|New edition| SUPERSEDED;
brendan-oconnell commented 1 month ago

@ja573 Do you think publication_date should be required for WITHDRAWN and SUPERSEDED works? On the one hand, withdrawn_date will be required for these work_status, and we only need one date for Crossmark (the date of the update, whether it be a withdrawal, new edition, etc.). So for Crossmark purposes, it's not essential.

On the other hand, these are works that, according to the workflow you outline in your diagram, should have passed through an ACTIVE state and have been published at some point, which would mean they would need to have a publication_date when they're ACTIVE. This would support requiring publication_date, because it should (theoretically) always be present.

On the other, other hand though, if publishers are adding back catalog titles to Thoth, and they want to add works that have already been withdrawn or superseded in their catalog, perhaps they might not know the publication date... which would support not requiring it, to avoid them introducing false metadata into Thoth. And I know the general philosophy has been to keep required fields to a minimum.

What do you think?

brendan-oconnell commented 1 month ago

@ja573 Do you think publication_date should be required for WITHDRAWN and SUPERSEDED works? On the one hand, withdrawn_date will be required for these work_status, and we only need one date for Crossmark (the date of the update, whether it be a withdrawal, new edition, etc.). So for Crossmark purposes, it's not essential.

On the other hand, these are works that, according to the workflow you outline in your diagram, should have passed through an ACTIVE state and have been published at some point, which would mean they would need to have a publication_date when they're ACTIVE. This would support requiring publication_date, because it should (theoretically) always be present.

On the other, other hand though, if publishers are adding back catalog titles to Thoth, and they want to add works that have already been withdrawn or superseded in their catalog, perhaps they might not know the publication date... which would support not requiring it, to avoid them introducing false metadata into Thoth. And I know the general philosophy has been to keep required fields to a minimum.

What do you think?

This was discussed in a team meeting, and we decided to make publication_date required for WITHDRAWN and SUPERSEDED works