open-contracting / standard

Documentation of the Open Contracting Data Standard (OCDS)
http://standard.open-contracting.org/
Other
138 stars 46 forks source link

tender.title: Recommend a length limit #1104

Closed tdavis9 closed 2 years ago

tdavis9 commented 3 years ago

We found when trying to use UK OCDS data that some of the titles were so long (mostly because they were listing information such as organizations in the title) that when we tried to use this for charts and exporting it, it essentially broke the system, and we had to create our own limits and cut them off at a certain point. Signifying how long a Tender Title (and other similar fields) could be would help make the data more usable.

jpmckinney commented 3 years ago

Thanks, @tdavis9. What would you recommend as an appropriate maximum length?

For 1.2, we can't add a validation rule (which would be backwards-incompatible), but we can add a recommendation in the field's description for an appropriate length.

tdavis9 commented 3 years ago

@jpmckinney we would recommend a max of 150 to 200 characters. Understood that for this it would need to be a recommendations. Here is an example of one of the titles that was problematic:

The framework agreement may be used by INTRAN members namely, Age UK Norwich, Big C, Benjamin Foundation,Breckland District Council, Broadland District Council, Broadland Housing Association, Cambridgeshire CRC, EACH, East Coast Community Healthcare, Equal Lives, Essex County Council, Essex CRC, Flagship Housing, Forest Heath District Council, St Edmundsbury Borough Council, Freebridge Community Housing, Great Yarmouth Borough Council, Healthwatch NorflkHrtfordshire County Council, James Page University Hospitals NHS Foundation Trust, King's Lynn and West Norfolk Borough Council, Kings Lynn Area Resettlement Support (KLARS), Leeway Domestic Abuse Services, Lighthouse Women's Aid, Magdalen Group, Mancroft Advice Project (MAP), Marie Stopes, Matthew Project, Mid-Norfolk CAB, Mistura Informatics — Choice and Medication, Mundesley Hospital, NHS England, NHS Great Yarmouth and Waveney CCG, NHS North Norfolk CCG, NHS Norwich CCG, NHS South Norfolk CCG, NHS West Norfolk CCG, Norfolk & Norwich University Hospital, Norfolk and Suffolk NHS Foundation Trust, Norfolk and Suffolk Probation CRC, Norfolk CAB, Norfolk Community Health and Care, Norfolk Community Law Service, Norfolk Constabulary, North-Norfolk District Council, Norwich & Central Mind, Norwich City Council, Norwich Charitable Trusts, Norwich Consolidated Charities, Ormiston Victory Academy, Right for Success Academy, Saffron Housing Trust, South Norfolk Council, Sue Lambert Trust, Suffolk Constabulary, Suffolk County Council, City College Norwich, City Academy Norwich, Wayland Academy, Fakenham Academy, Attleborough Academy, The Queen Elizabeth Hospital Kings Lynn NHS Trust, Together for Well Being, Victory Housing Trust, West Suffolk Hospital NHS Trust, Wherry Housing Association, Norfolk

tdavis9 commented 3 years ago

For fields that would likely be used in an index, 800 characters is the max that we've seen it function decently at.

jpmckinney commented 3 years ago

Indeed – putting all the buyers within a framework agreement in the tender title is not helpful :)

jpmckinney commented 3 years ago

Can we check a few collection summaries to get the 90% percentile title length? e.g.

SELECT PERCENTILE_CONT(0.9) WITHIN GROUP(ORDER BY LENGTH(tender_title)) FROM view_data_bi_tool_portugal.tender_summary;

Note that some publishers don't have tender titles.

Update: There's probably duplication of data sources among these schema, but the range is 15-301 characters for the 90% percentile. Among these numbers, the average is 98, median is 92, and 90% percentile is 151.

If we choose a limit of 150, then about 90% of datasets will have 90-100% of their titles with fewer than 150 characters. And 10% of datasets will have 10% or more of their titles with more than 150 characters.

| 90% percentile | schema | | - | - | | 15 | view_data_bi_tool_moldova | | 15 | view_data_collection_1497_1498_1499 | | 15 | view_data_collection_1509_1510_1511 | | 15 | view_data_collection_1551_1552_1553 | | 15 | view_data_collection_1563_1564_1565 | | 15 | view_data_collection_1593_1594_1595 | | 15 | view_data_collection_1596_1597_1598 | | 15 | view_data_collection_1611_1612_1613 | | 23 | view_data_collection_1325 | | 23 | view_data_latam_blog_honduras_oncae_sefin | | 24 | view_data_colombia | | 24 | view_data_latam_blog_colombia | | 27 | view_data_collection_1349 | | 28 | view_data_latam_blog_uruguay | | 39 | view_data_taiwan_learning_man | | 42 | view_data_collection_1722_1724 | | 56.6 | view_data_nigeria_portal_MEL_1538 | | 60 | view_data_colombia_2020_202103 | | 75 | view_data_collection_1602 | | 75 | view_data_collection_1604 | | 75 | view_data_collection_1697 | | 77 | view_data_latam_blog_chile | | 80 | view_data_collection_2013_2021 | | 80 | view_data_latam_blog_mexico_administracion_publica_federal | | 80 | view_data_mexico_apf | | 82 | view_data_collection_1712 | | 85 | view_data_2020_10_scotland_review | | 85 | view_data_collection_1493 | | 87 | view_data_collection_1685 | | 91 | view_data_collection_1620_1622 | | 91 | view_data_collection_1643_1645 | | 91 | view_data_collection_1730_1779 | | 91 | view_data_collection_1883_1885 | | 91 | view_data_collection_731 | | 92 | view_data_collection_1601 | | 92 | view_data_collection_1899 | | 92 | view_data_collection_1914 | | 92 | view_data_uk_cf_fts | | 93 | view_data_collection_1782 | | 93 | view_data_collection_1920_1922 | | 93 | view_data_latam_blog_buenos_aires | | 94 | view_data_collection_1728_1730 | | 99 | view_data_icw_2011_2012_mel1 | | 100 | view_data_view_1_2_research | | 105 | view_data_latam_blog_paraguay_dncp | | 114 | view_data_collection_2009_2010 | | 115 | view_data_budeshi_update | | 116.2 | view_data_collection_1720_1721 | | 121 | view_data_latam_blog_paraguay_hacienda | | 123.9 | view_data_collection_1520 | | 124 | view_data_collection_1452_1454 | | 125 | view_data_collection_1226 | | 125 | view_data_collection_1718 | | 126 | view_data_collection_1518_1519_1520 | | 126 | view_data_latam_blog_dominican_republic | | 136 | view_data_costa_rica | | 136 | view_data_latam_blog_costa_rica_poder_judicial | | 137.5 | view_data_collection_1992_1993 | | 138 | view_data_bi_tool_portugal | | 138 | view_data_collection_1530_1531_1532 | | 138 | view_data_collection_1532 | | 138 | view_data_poder_judicial_costa_rica | | 142.2 | view_data_latam_blog_mexico_nuevo_leon | | 144 | view_data_latam_blog_bolivia_agetic | | 145 | view_data_collection_1636_1638_1639 | | 145 | view_data_collection_1770_1772_1773 | | 145 | view_data_collection_1942_1943_1944 | | 146 | view_data_collection_1962 | | 149.1 | view_data_collection_1617_1618_1619 | | 151 | view_data_collection_1659_1660_1661 | | 158 | view_data_zaragoza | | 210 | view_data_collection_1473_1474_1475 | | 210 | view_data_collection_2026_2027 | | 219 | view_data_collection_1560_1561_1562 | | 219 | view_data_ondata | | 236.2 | view_data_latam_blog_mexico_inai | | 301 | view_data_latam_blog_argentina_vialidad |
Ravf95 commented 2 years ago

With this update, the 90% percentile is 148.1. We have 33 publishers with information about tender titles and 86 publishers where the value is empty value or don't have.

I think we can choose the limit between 160 and 200 characters.

Details about others collections 90% percentile | schema -- | -- 12.0 | view_data_collection_2058_2059 27.0 | view_data_collection_2135_2136 29.0 | view_data_uruguay_historical 31.0 | view_data_collection_2139_2140 60.0 | view_data_collection_2033 77.8 | view_data_mexico_idaip 80.0 | view_data_mexico_administracion_publica_federal 81.0 | view_data_mexico_yucatan_inaip 86.0 | view_data_africa_checks 86.0 | view_data_collection_2144_2145 89.0 | view_data_collection_2211_2212 90.5 | view_data_collection_2050_2051 91.0 | view_data_collection_2099_2100 92.0 | view_data_collection_1599_1600_1601 92.0 | view_data_collection_1600_1601 92.0 | view_data_collection_2062_2063 93.0 | view_data_collection_2097_2098 95.0 | view_data_collection_2188_2189_2190_2191 98.0 | view_data_collection_2115_2116 99.0 | view_data_collection_2052_2053 100.0 | view_data_nigeria_2182_3 109.0 | view_data_colombia_2214 115.0 | view_data_collection_2146_2147_2148 122.0 | view_data_collection_2075_2077 123.0 | view_data_collection_2201_2202 129.0 | view_data_collection_2150_2151_2152 133.0 | view_data_collection_2108_2109 140.6 | view_data_paraguay_covid 145.0 | view_data_collection_2086_2087 148.1 | view_data_collection_2125_2126 151.0 | view_data_collection_2160_2161 158.0 | view_data_collection_2158_2159 158.0 | view_data_zaragoza_review
Ravf95 commented 2 years ago

Another recommendation we can add and is very helpful for publishers is to recommend that the information about the organizations should be in parties and not list all in tender.title

Ravf95 commented 2 years ago

@jpmckinney @yolile What do you think of the suggestions?

jpmckinney commented 2 years ago

Let's go with 150.