open-contracting / data-registry

BSD 3-Clause "New" or "Revised" License
3 stars 0 forks source link

Include the publication policy link on the dataset site #192

Open yolile opened 2 years ago

yolile commented 2 years ago

Some publication policies are good and have valuable information for the data users. It will be helpful to include the "Publication Policy" field on the dataset site. We can auto-populate the field with the pelican data or manually if the publisher is not publishing the publication policy link in their JSONs correctly.

@sabahfromlondon what do you think?

jpmckinney commented 2 years ago

@yolile Can you share an example with Sabah, in case she hasn't seen a good one before?

yolile commented 2 years ago

I have some examples in Spanish but not so many in English. But maybe the UK one is a good one https://www.gov.uk/government/publications/open-contracting and Zambia https://www.zppa.org.zm/ocds-publication-policy

In Spanish some examples are:

yolile commented 2 years ago

Another one in English http://dppib-crsgov.org/publicationpolicy.html

yolile commented 2 years ago

All of the INAI's publications publish a good publication policy, more examples: http://ceaipsinaloa.ddns.net:4000/contratacionesabiertas/politicadepublicacion https://dashboard.infocdmx.org.mx/contratacionesabiertas/politicadepublicacion etc

And this is another example from Honduras https://portalunico.iaip.gob.hn/datosabierto/docs/Pol%C3%ADtica%20de%20publicaci%C3%B3n%20-%20IAIP%20Datos%20Abiertos%20OCDS.pdf

jpmckinney commented 2 years ago

+1 from Sabah via Slack

jpmckinney commented 1 year ago

We'd need to update update_collection_metadata to either:

  1. Set a new publication_policy field. Manually set the publication policy for existing collections.
  2. Store the entire response from Pelican in a new JSONField. Manually set the publication policy for existing collections (e.g. by constructing the JSON field's values from the license, etc.).

(2) means that, if at a later date we decide to add another metadata field from Pelican, we won't need to do any manual work.

yolile commented 1 year ago

Let's do (2) then. How should we name this new field? metadata, pelican_response, pelican_metadata, other?

For the record, this is a sample snippet of what pelican currently returns:

{
  "url": "The URL where the data can be downloaded isn't presently available.",
  "publisher": "Instituto Duranguense de Acceso a la Información Pública y de Protección de Datos Personales",
  "extensions": [
  ],
  "ocid_prefix": "ocds-ywf11i",
  "data_license": "https://datos.gob.mx/libreusomx",
  "published_to": "2021-11-03 19.51.47",
  "published_from": "2021-08-04 18.06.41",
  "publication_policy": null
}
yolile commented 1 year ago

Note that the only fields that we are missing from pelican are extensions and publication_policy, but I'm not sure if we will add more fields to pelican itself in the future. I'm happy with (2), but I'm not sure if we should remove all the other existing columns (ocid_prefix, date_from, date_to, license) and use the new JSON column instead, to be consistent.

yolile commented 1 year ago

And, we want the publication_policy to be editable, right? From https://github.com/open-contracting/data-registry/issues/256:

I think this field is currently a property of the job (which makes sense). For the override, we would put it on the publication itself (and we'll need to remember to change the end date periodically).

Should we also put this field (metadata or publication_policy) in the publication/collection form?

jpmckinney commented 1 year ago

Let's call it extracted_metadata.

I'm not sure if we should remove all the other existing columns (ocid_prefix, date_from, date_to, license)

In #256 we want date_to/date_from to be overridable. So we can keep those. I think OCID prefix will always be correct, so we can remove that one.

Similarly, some publishers don't put a license in the package metadata, but they do include one in their docs, so we can leave license_custom to be overridden. (We can rename it to license and set db_column="license_custom", so that at least within the code the name of the field is consistent.)

Should we also put this field (metadata or publication_policy) in the publication/collection form?

Let's save extracted_metadata for what's automatically extracted (read only), and then we can add publication_policy for the override. That way, the editable field will render automatically in the Django admin (not sure if there are good packages for editing structured JSON fields). We can display the extracted_metadata similar to how the Job context is rendered, so that admins can see whether they want to override something.

We can add a method to the model that returns the original metadata with any overrides applied, as a new dict. That way, view code can just call instance.metadata.license and render the license, without worrying about whether it is the original value or not.

yolile commented 1 year ago

In https://github.com/open-contracting/data-registry/issues/256 we want date_to/date_from to be overridable. So we can keep those

Hmm actually, I think that maybe we want the publication policy to be overridable too, I can't remember a specific case but I'm pretty sure that there are some cases where a publication policy exists but is not referenced in the package metadata

jpmckinney commented 1 year ago

Yes, we also want publication_policy to be overridable.

and then we can add publication_policy for the override

yolile commented 1 year ago

Oh, true, sorry, I commented before finishing reading 😸 all good then, I will implement https://github.com/open-contracting/data-registry/issues/192#issuecomment-1326694504

yolile commented 1 month ago

Should we implement:

https://github.com/open-contracting/data-registry/issues/291#issuecomment-2045960744

instead of still using Pelican's metadata? Or should we mark this as blocked until #291 is done?

jpmckinney commented 1 month ago

We can do this without doing #291. Doing this is very easy. Doing #291 is much more work.

Edit: Ah, you mean doing the following to resolve this issue:

From Pelican we get field counts and also some collection metadata. We can get the latter via an HTTP request to Kingfisher Process in the Process task's get_status method (once is_last_completed is true): https://github.com/open-contracting/kingfisher-process/issues/421

Sure, we can get the data from Process instead.