opensupplyhub / open-apparel-registry

An application for searching, matching, uploading factories.
MIT License
32 stars 13 forks source link

Run merge moderation script as part of the production release of OS Hub #1927

Closed vrwOAR closed 1 year ago

vrwOAR commented 2 years ago

Execute a bulk merge via scripts in preparation for Beta launch.

Target merge facilities and source facility(ies) to be merged into the target will be identified and cluster IDs will be added to each merge grouping.

File containing merge information will be added to the ticket prior to submission for sprint planning.

vrwOAR commented 2 years ago

@mariel-oar This ticket is updated for sprint planning. Merge cluster listing

mariel-oar commented 2 years ago

This ticket should be sequenced such that it happens on or before with the OAR database migration to OS Hub (said another way: this ticket is for merging duplicates that exist in the current OAR database that we want to resolve in the version of the database that gets copied to OS Hub).

jwalgran commented 2 years ago

I labeled this client action needed because the OS Hub Team will need to provided the merge list just priory to launch.

mariel-oar commented 2 years ago

@jwalgran, might this be a good issue for @KlausGPaul to be involved in? The idea being that we could run these more often if it's something our team can control.

KlausGPaul commented 2 years ago

We'd have datasets for Indian companies (some 1.2 million), and a synthetic dataset of Turkish addresses (the addresses seem to be real, but names are synthetic), 100K or 1000K. I will make a plot of the spatial distributions, then we see if this buys us something.

vrwOAR commented 2 years ago

Merge moderation listing

mariel-oar commented 2 years ago

@obrienad, we have been working with the understanding that this ticket would be picked up for next sprint. Based on that, @vrwOAR is working to finalize the list by Sept 29 (Thursday). Is this information sufficient to remove the CAN label?

obrienad commented 2 years ago

Great, thanks! Let's keep it on and @vrwOAR can reply in thread when it's good to go and I'll remove the label then!

vrwOAR commented 2 years ago

Hi all - @obrienad this ticket is ready to be picked up. Thanks

TaiWilkin commented 2 years ago

@vrwOAR The merge script has been run successfully. Halfway through attempting to run it the first time, an API Block was applied; after assigning myself a grace limit, I was able to complete the script successfully. Note for the future that we should ensure the token used for this script has unlimited API requests assigned. On the first attempt, the following errors ocurred:

('Status:', 400, 'Content:', '{"target":["Facility IN2021271XS1AKT does not exist."]}', 'Target ID:', 'IN2021271XS1AKT', 'Merge ID:', 'IN2019181A16JXD')
('Status:', 400, 'Content:', '{"merge":["Facility TR2019083ETMJ3E does not exist."]}', 'Target ID:', 'TR20193173JXFWT', 'Merge ID:', 'TR2019083ETMJ3E')
('Status:', 400, 'Content:', '{"merge":["Facility IN20211597CNK4A does not exist."]}', 'Target ID:', 'IN202218625HBK2', 'Merge ID:', 'IN20211597CNK4A')
('Status:', 400, 'Content:', '{"merge":["Facility IN2019181Q2G965 does not exist."]}', 'Target ID:', 'IN2021271G3BRNH', 'Merge ID:', 'IN2019181Q2G965')
('Status:', 400, 'Content:', '{"merge":["Facility TR2022027WJTDER does not exist."]}', 'Target ID:', 'TR2019083QKH765', 'Merge ID:', 'TR2022027WJTDER')
('Status:', 400, 'Content:', '{"merge":["Facility TR202030139YZF6 does not exist."]}', 'Target ID:', 'TR2019083QKH765', 'Merge ID:', 'TR202030139YZF6')
('Status:', 400, 'Content:', '{"merge":["Facility TR20203512KKF8J does not exist."]}', 'Target ID:', 'TR2019083T3GQ78', 'Merge ID:', 'TR20203512KKF8J')
('Status:', 400, 'Content:', '{"merge":["Facility TR20200983VJ4Q6 does not exist."]}', 'Target ID:', 'TR2019083T3GQ78', 'Merge ID:', 'TR20200983VJ4Q6')
('Status:', 400, 'Content:', '{"merge":["Facility TR2019178KXQERF does not exist."]}', 'Target ID:', 'TR2019083XMDKT8', 'Merge ID:', 'TR2019178KXQERF')
('Status:', 400, 'Content:', '{"merge":["Facility TR2021159TEKK2A does not exist."]}', 'Target ID:', 'TR2019083XMDKT8', 'Merge ID:', 'TR2021159TEKK2A')
('Status:', 400, 'Content:', '{"target":["Facility IN2022007MDS525 does not exist."]}', 'Target ID:', 'IN2022007MDS525', 'Merge ID:', 'IN2019181FFP3JP')
('Status:', 400, 'Content:', '{"target":["Facility IN2022007MDS525 does not exist."],"merge":["Facility IN2022063T4Y9MZ does not exist."]}', 'Target ID:', 'IN2022007MDS525', 'Merge ID:', 'IN2022063T4Y9MZ')
('Status:', 400, 'Content:', '{"merge":["Facility IN2022007F2WBB7 does not exist."]}', 'Target ID:', 'IN2021342TFV77Q', 'Merge ID:', 'IN2022007F2WBB7')
('Status:', 400, 'Content:', '{"merge":["Facility IN2022012MZMV8A does not exist."]}', 'Target ID:', 'IN2021342TFV77Q', 'Merge ID:', 'IN2022012MZMV8A')
('Status:', 400, 'Content:', '{"merge":["Facility TR2020070NV9G82 does not exist."]}', 'Target ID:', 'TR20203513VKQ72', 'Merge ID:', 'TR2020070NV9G82')
('Status:', 400, 'Content:', '{"merge":["Facility TR2021182X2Z952 does not exist."]}', 'Target ID:', 'TR2020204WNSQZG', 'Merge ID:', 'TR2021182X2Z952')
('Status:', 400, 'Content:', '{"target":["Facility TR2019259DCC209 does not exist."]}', 'Target ID:', 'TR2019259DCC209', 'Merge ID:', 'TR2019172ZYZNYB')
('Status:', 400, 'Content:', '{"target":["Facility TR2019259TQN4M0 does not exist."]}', 'Target ID:', 'TR2019259TQN4M0', 'Merge ID:', 'TR20210861BXSJD')
('Status:', 400, 'Content:', '{"target":["Facility TR2019259TQN4M0 does not exist."]}', 'Target ID:', 'TR2019259TQN4M0', 'Merge ID:', 'TR2019083G5XWS4')
('Status:', 400, 'Content:', '{"merge":["Facility IN2020085EHVSJD does not exist."]}', 'Target ID:', 'IN2021162RH4X6N', 'Merge ID:', 'IN2020085EHVSJD')
('Status:', 400, 'Content:', '{"merge":["Facility TR2020191QKY8MS does not exist."]}', 'Target ID:', 'TR2019172ZVAW3Z', 'Merge ID:', 'TR2020191QKY8MS')
('Status:', 400, 'Content:', '{"merge":["Facility TR2019259QK9CTV does not exist."]}', 'Target ID:', 'TR2020053K07BF6', 'Merge ID:', 'TR2019259QK9CTV')
('Status:', 400, 'Content:', '{"merge":["Facility IN2020239FZYMHT does not exist."]}', 'Target ID:', 'IN2021336X8JTXP', 'Merge ID:', 'IN2020239FZYMHT')
('Status:', 400, 'Content:', '{"target":["Facility IN2020053842XZD does not exist."]}', 'Target ID:', 'IN2020053842XZD', 'Merge ID:', 'IN2021340TRKHJB')
('Status:', 400, 'Content:', '{"merge":["Facility BD2020021177SB0 does not exist."]}', 'Target ID:', 'BD20211599W2XNK', 'Merge ID:', 'BD2020021177SB0')
('Status:', 400, 'Content:', '{"target":["Facility TR20191725RKFC7 does not exist."]}', 'Target ID:', 'TR20191725RKFC7', 'Merge ID:', 'TR2020330841RSE')
('Status:', 400, 'Content:', '{"target":["Facility TR20191723D9751 does not exist."]}', 'Target ID:', 'TR20191723D9751', 'Merge ID:', 'TR2021336AFDR2Q')
('Status:', 400, 'Content:', '{"target":["Facility TR20191723D9751 does not exist."]}', 'Target ID:', 'TR20191723D9751', 'Merge ID:', 'TR20191761HSFA4')
('Status:', 400, 'Content:', '{"merge":["Facility IN20190833ATEHM does not exist."]}', 'Target ID:', 'IN2020053QN9B47', 'Merge ID:', 'IN20190833ATEHM')
('Status:', 400, 'Content:', '{"target":["Facility IN20202160SCSQS does not exist."]}', 'Target ID:', 'IN20202160SCSQS', 'Merge ID:', 'IN2022063J25YMH')
('Status:', 400, 'Content:', '{"merge":["Facility IN2019083R8ETYM does not exist."]}', 'Target ID:', 'IN202001054KF07', 'Merge ID:', 'IN2019083R8ETYM')
('Status:', 400, 'Content:', '{"target":["Facility IN2021102HB5PA0 does not exist."],"merge":["Facility IN20213154ZFVTC does not exist."]}', 'Target ID:', 'IN2021102HB5PA0', 'Merge ID:', 'IN20213154ZFVTC')

On the second attempt, the facilities which had been merged successfully all returned 'does not exist' errors (as expected), so the error response was mostly too noisy to be useful. These are the errors that were thrown in the second attempt that were not 'does not exist' errors:

('Status:', 400, 'Content:', '{"target":["Cannot be the same as merge."],"merge":["Cannot be the same as target."]}', 'Target ID:', 'CN20190830X6JJF', 'Merge ID:', 'CN20190830X6JJF')
('Status:', 400, 'Content:', '{"target":["Cannot be the same as merge."],"merge":["Cannot be the same as target."]}', 'Target ID:', 'IN2019172EWGPVR', 'Merge ID:', 'IN2019176HE1SCP')