odia / buenosairescompras

GNU Affero General Public License v3.0
3 stars 1 forks source link

Use scrapped data instead of CSV data? #3

Closed seppo0010 closed 2 years ago

seppo0010 commented 2 years ago

Sometimes, quite often actually, the data between the download CSV and the page is inconsistent. Take this row:

ocid                        id                         date                       initiationType  tag     tender/id       tender/title      tender/description                           tender/status  tender/procuringEntity/id  tender/value/currency  tender/value/amount  tender/procuringEntity/name  tender/minValue/amount  tender/minValue/currency  tender/procurementMethod  tender/procurementMethodDetails  tender/procurementMethodRationale  tender/mainProcurementCategory  tender/awardCriteria  tender/awardCriteriaDetails  tender/submissionMethod  tender/submissionMethodDetails  tender/tenderPeriod/startDate  tender/tenderPeriod/endDate  tender/tenderPeriod/durationInDays  tender/tenderPeriod/maxExtentDate  tender/enquiryPeriod/startDate  tender/enquiryPeriod/endDate  tender/enquiryPeriod/maxExtentDate  tender/enquiryPeriod/durationInDays  tender/hasEnquiries  tender/eligibilityCriteria  tender/awardPeriod/startDate  tender/awardPeriod/endDate  tender/awardPeriod/maxExtentDate  tender/awardPeriod/durationInDays  tender/contractPeriod/startDate  tender/contractPeriod/endDate  tender/contractPeriod/maxExtentDate  tender/contractPeriod/durationInDays  tender/numberOfTenderers  tender/amendment/date  tender/amendment/rationale  tender/amendment/id  tender/amendment/description  tender/amendment/amendsReleaseID  tender/amendment/releaseID  language  tender/items/0/id     tender/items/0/description                                                                                                                                                                                                                                                                                    tender/items/0/quantity  tender/items/0/unit/name  tender/items/0/unit/scheme  tender/items/0/classification/scheme  tender/items/0/classification/id  tender/items/0/unit/value/amount  tender/items/0/unit/value/currency  tender/documents/0/id                                             tender/documents/0/documentType  tender/documents/0/url                                                                                                                               tender/documents/0/datePublished  tender/documents/0/language
ocds-bulbcf-425-0784-CDI18  425-0784-CDI18-2021-07-27  2018-07-20T09:00:00-03:00  tender          tender  419-2526-CME18   Higiene urbana-  SERVICIO DE RECOLECCIÓN RESIDUOS PELIGROSOS  complete       CABA-UE-425                ARS                    75600.0              HTAL. JOSE M. PENNA                                                            direct                    CONTRATACION DIRECTA                                                Salud                                                                                                                                       2018-07-20T10:00:00-03:00      2018-07-30T10:00:00-03:00    10.0                                                                   2018-07-20T10:00:00-03:00       2018-07-25T10:00:00-03:00                                         5.0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           es        33.14.002.0005.578-0  SERVICIO DE REPARACION INTEGRAL DE TRANSDUCTOR PARA MONITOR DE LATIDOS FETALES    Marca: Bistos Modelo: Marca BISTOS, modelo UC.PROBE, para monitor BISTOS, modelo BT-300, el servicio incluye mano de obra e insumos para su correcto funcionamiento Variedad: Transductor para monitor de latidos fetales   8.0                      UNIDADES                  x_unidades_medida_bac       x_catalogo_bienes_servicios_bac       33.14.002.0005.578                1500.0                            ARS                                 BQoBkoMoEhx991Pc/RCO|L60g0OlCCrrEhBwHiyf1NQXpR/xmckEvMFErRJ06JhO  tenderNotice                     https://www.buenosairescompras.gob.ar//PLIEGO/VistaPreviaPliegoCiudadano.aspx?qs=BQoBkoMoEhx991Pc/RCO%7CL60g0OlCCrrEhBwHiyf1NQXpR/xmckEvMFErRJ06JhO  2018-07-20T09:00:00-03:00         es

The description is "SERVICIO DE RECOLECCIÓN RESIDUOS PELIGROSOS". However opening the URL the description is " SERVICIO DE REPARACIÓN INTEGRAL DE TRANSDUCTOR PARA MONITOR DE LATIDOS FETALES " which is completely unrelated. Also the amount is completely wrong. It seems like the URL belong to a totally different process. Maybe this one?

This should not happen but it does. And we need to make a decision about it. I believe the scrapped data should take precedence over the CSV data, as it would provide the user with a consistent experience.

xaiki commented 2 years ago

i guess we should write a scrapper and document the differences. i'm not convinced the truth will be uniformly exposed. i'd look into kingfisher for this.

xaiki commented 2 years ago

https://github.com/open-contracting/kingfisher-collect/blob/main/kingfisher_scrapy/spiders/argentina_buenos_aires.py

for futher ref