open-contracting / kingfisher-collect

Downloads OCDS data and stores it on disk
https://kingfisher-collect.readthedocs.io
BSD 3-Clause "New" or "Revised" License
13 stars 12 forks source link

Establish norm for spider names #797

Closed jpmckinney closed 2 years ago

jpmckinney commented 3 years ago

Especially once the registry is deployed, it will be difficult to change things.

jpmckinney commented 3 years ago

From #655

Noting that we could rename all the digiwhist spiders as part of this.

jpmckinney commented 3 years ago

@yolile My proposed guidance:

Lowercase and join the components below with underscores. Replace any spaces with underscores.

For a jurisdiction-specific publication:

-  Country name. Do not use acronyms, like "uk". If in doubt, follow `ISO 3166-1 <https://en.wikipedia.org/wiki/ISO_3166-1>`__. For example: Kyrgyzstan, not Kyrgyz Republic. For a non-country like the European Union, use the relevant geography, like "europe".
-  Subdivision name. Do not use acronyms, like "nsw". Omit the subdivision type, like "state", unless it is typically included, like in Nigeria. If in doubt, follow `ISO 3166-2 <https://en.wikipedia.org/wiki/ISO_3166-2>`__.
-  System name, if needed. Acronyms are allowed, like "agetic".
-  Publisher name, if needed. Required if the publisher is not a government.
-  Disambiguator, if needed. For example: "historical".
-  Access method, if needed: "bulk" or "api".
-  OCDS format, if needed: "releases", "records", "release packages" or "record packages".

For a multi-jurisdiction publication:

-  Organization name
-  Disambiguator

If you create a new base class, omit the components that are not shared, and add "base" to the end. For example, the ``afghanistan_packages_base.py`` file contains the base class for the ``afghanistan_record_packages`` and ``afghanistan_release_packages`` spiders.

Based on scrapy list, this means we'd need to change:

Before After
[X] australia_nsw australia_new_south_wales
[X] colombia colombia_api
[X] digiwhist_* *_digiwhist
[X] dominican_republic dominican_republic_bulk
[X] nigeria_cross_river_base nigeria_cross_river_state_base
[X] nigeria_cross_river_releases nigeria_cross_river_state_releases
[X] nigeria_cross_river_records nigeria_cross_river_state_records
[X] nigeria_kaduna_state_base nigeria_kaduna_state_budeshi_base
[X] nigeria_kaduna_state_records nigeria_kaduna_state_budeshi_records
[X] nigeria_kaduna_state_releases nigeria_kaduna_state_budeshi_releases
[X] portugal portugal_bulk
[X] uk_contracts_finder united_kingdom_contracts_finder
[X] uk_fts united_kingdom_fts
[X] uk_fts_test united_kingdom_fts_test
[X] mexico_infoem mexico_mexico_infoem

Do you agree? If so I can make the change now, and update publications in the registry.

We then need to inform the helpdesk, and CDS for when creating new spiders.

yolile commented 3 years ago

Sounds good and consistent for me. Could you also update the documentation to include this convention as part of your changes? eg at https://kingfisher-collect.readthedocs.io/en/latest/contributing/index.html#write-a-spider We will also need to run the updatedocs command and add united_kingdom here https://github.com/open-contracting/kingfisher-collect/blob/2e3161e86ad24aa30ebf19edcda56f8cf457972c/kingfisher_scrapy/commands/updatedocs.py#L20

yolile commented 3 years ago

Should georgia_opendata be renamed to georgia_bulk ? And georgia_records and georgia_releases to georgia_api_records, georgia_api_releases ? And similary honduras_portal_records and honduras_portal_releases to honduras_portal_api_records and honduras_portal_api_releases?

And nepal_portal to nepal_ppip ? And nigeria_portal to nigeria_nocopo ? And openopps to ? And I guess we should rename chile_compras_ to just chile_, and similar for peru, from peru_compras to peru or to peru_peru_compras And pakistan_ppra_releases to pakistan_ppra_api And uganda_releases to uganda

And I guess moldova_old is an exception

jpmckinney commented 3 years ago

Yes, the above is RST so I can paste it in easily :) I've prepared updatedocs locally as well. I'll make a PR.

georgia_opendata is a different website/data source than georgia_records and georgia_releases, so I think they are fine as-is.

There is typically only one bulk format, even if there are two API formats. If we do this, we'd need to also change chile_compra and pakistan_ppra. I guess it is more consistent, and it's just 4 more spiders. What do you think?

jpmckinney commented 3 years ago

Noting that I should check whether any logic depends on the spider name in the data registry (maybe the wiper or exporter?).

Update:

Related: https://github.com/open-contracting/data-registry/issues/154

yolile commented 3 years ago

@jpmckinney ups I was updating my comment, see the updated list of possible changes now.

georgia_opendata is a different website/data source than georgia_records and georgia_releases, so I think they are fine as-is.

That is the same for portugal too (a website for the bulk and another one for the api)

jpmckinney commented 3 years ago
Suggestion Comment
georgia_opendata to georgia_bulk, georgia_records to georgia_api_records, georgia_releases to georgia_api_releases https://odapi.spa.ge and http://opendata.spa.ge are distinct implementations per CRM-7092. They aren't access methods to the same implementation. georgia_opendata isn't used in the registry and might be deleted eventually. With Portugal, they seem to be the same implementation, even if the websites are different.
honduras_portal_records to honduras_portal_api_records, honduras_portal_releases to honduras_portal_api_releases OK, and same for chile_compra_releases and chile_compra_records
nepal_portal to nepal_ppip Wouldn't it be nepal_ppmo? Or we can go with nepal if we're renaming it either way.
nigeria_portal to nigeria_nocopo No change. "portal" is already in "Nigeria Open Contracting Portal".
openopps to ? No change. Follows the rule for "multi-jurisdiction publication".
chilecompra to chile_, peru_compras to peru or peru_peru_compras "compra" and "compras" aren't required for disambiguation, but we don't need to be minimal. I think the repetition of "peru" is too weird. I'll add "If a component repeats another, you can omit or abbreviate the component."
pakistan_ppra_releases to pakistan_ppra_api OK
uganda_releases to uganda "releases" is not required for disambiguation, but we don't need to be minimal. Records endpoints are documented, but they don't work.
moldova_old is an exception "old" is a disambiguator. That said, we can maybe rename moldova to moldova_mtender. In any case, moldova_old isn't used in the registry and might be deleted eventually.
yolile commented 3 years ago

Wouldn't it be nepal_ppmo?

The site's domain (http://ppip.gov.np/) is ppip for "Public Procurement Transparency Initiative in Nepal", although the publisher is PPMO, so either way is fine for me.

chilecompra to chile_, peru_compras to peru or peru_peru_compras

Thinking again, maybe it is better to just leave them as they are, as I know that another national level publisher is thinking of implementing OCDS in Peru, and in Chile too.

jpmckinney commented 3 years ago

PPTIN is used on the website. I don’t know why PPIP is in the URL. It’s not defined anywhere.

On Wednesday, September 15, 2021, Yohanna Lisnichuk < @.***> wrote:

Wouldn't it be nepal_ppmo?

The site's domain (http://ppip.gov.np/) is ppip for "Public Procurement Transparency Initiative in Nepal", although the publisher is PPMO, so either way is fine for me.

chilecompra to chile_, peru_compras to peru or peru_peru_compras

Thinking again, maybe it is better to just leave them as they are, as I know that another national level publisher is thinking of implementing OCDS in Peru, and in Chile too.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-contracting/kingfisher-collect/issues/797#issuecomment-920539688, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAGOX5L4EFFOKXEAQUKL2LUCFNN7ANCNFSM5EDNKHZQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

--

James McKinney

Head of Data Products and Services

+1-514-247-0223 | @mckinneyjames | skype: mckinney.james | timezone: EST

What’s hot? The G7 endorses open contracting https://www.open-contracting.org/news/g7-commits-to-open-and-participatory-public-procurement-reforms/ and our new Quickstart Guide https://www.open-contracting.org/resources/quickstart-guide/ helps power up your procurement reforms

www.open-contracting.org | follow us @opencontracting

jpmckinney commented 3 years ago

This probably won't happen before the launch of the registry. Ideally, what we should do before launch is https://github.com/open-contracting/data-registry/issues/154 This will make it so that when we change spider names later, it will not break data URLs for users. We will just need to update all publications to use the new spider name. (We can freeze the publications before deploying Kingfisher Collect, so that none of them try to collect data from a non-existent spider.)