tosdr / edit.tosdr.org

šŸ‘šŸ‘Ž A new web app to rate services
https://edit.tosdr.org
GNU Affero General Public License v3.0
219 stars 37 forks source link

Duplicate documents #914

Closed michielbdejong closed 3 years ago

michielbdejong commented 4 years ago

In order to resolve the duplicate services problem, it makes sense to attack the duplicate documents problem first, because it's not desirable to merge two services and then have its documents end up being duplicated.


phoenix_development=# select distinct a.url from documents a inner join documents b on  a.url = b.url where a.id != b.id;
                                                     url                                                      
--------------------------------------------------------------------------------------------------------------
 https://www.flickr.com/help/terms
 https://www.grabcraft.com/terms-of-use/
 https://tradestatistics.io/site/data/
 https://ello.co/wtf/policies/privacy/
 https://www.atlassian.com/legal/cloud-terms-of-service
 https://iziit.org/politique-de-confidentialite
 https://mullvad.net/en/guides/no-logging-data-policy/
 https://www.apple.com/legal/internet-services/itunes/us/terms.html
 https://www.epicgames.com/site/en-US/tos
 https://www.pathofexile.com/legal/terms-of-use-and-privacy-policy
 http://qc7ilonwpv77qibm.onion/
 https://www.airbnb.com/terms
 https://looparo.com/integrity-and-privacypolicy/
 https://discordapp.com/terms
 https://signal.org/legal/
 https://www.funimation.com/terms-of-use/
 https://www.shopify.com/legal/terms-payments-us
 https://www.britishairways.com/en-ch/information/legal/privacy-policy
 https://www.spotify.com/us/legal/end-user-agreement/
 https://dutchie.com/terms
 https://tutanota.com/fr/privacy
 https://stackoverflow.com/legal/cookie-policy
 https://www.apple.com/legal/privacy/en-ww/
 https://www.drugstore99.com/pages/privacy-policy
 https://www.jagex.com/terms/privacy
 https://privacy-hub.sainsburys.co.uk/privacy-policy/
 https://www.jagex.com/terms
 https://www.storemorestore.com/terms_privacy.asp
 https://www.prosper.com/account/common/agreement_view.aspx?agreement_type_id=1005
 https://factsmgt.com/privacy-policy/
 https://dnd5e.info/privacy-statement/
 https://matrix.org/legal/terms-and-conditions/
 https://www.furaffinity.net/tos
 https://www.sophos.com/en-us/legal/sophos-end-user-license-agreement.aspx
 https://chequered.ink/privacy-policy/
 https://wiki.roll20.net/Terms_of_Service_and_Privacy_Policy#English.2C_not_Legalese
 https://factsmgt.com/terms-of-use
 https://www.cbsinteractive.com/legal/cbsi/privacy-policy
 https://www.ycombinator.com/legal/
 https://leonardohobby.ru/offer/
 https://www.voxmedia.com/legal/privacy-policy
 https://www.privateinternetaccess.com/pages/privacy-policy/
 https://mastodon.social/terms
 https://citizen.com/privacy/lawenforcement
 https://www.flickr.com/help/cookies
 https://www.mrhomebody.com/privacy/
 https://vocalise.me.uk/privacy-policy
 https://homeschool-class.com/index.php/cookie-policy-us/
 https://www.epicgames.com/site/en-US/privacypolicy
 https://alirezahayati.com/copyleft
 https://blendermarket.com/policies/privacy-policy
 https://www.minds.com/p/terms
 https://spotterlead.net/Home/Privacy
 https://www.barstoolsports.com/privacy-policy
 https://politicsandwar.com/terms/
 https://scratch.mit.edu/privacy_policy
 https://www.whatsapp.com
 https://help.nytimes.com/hc/en-us/articles/115014893428-Terms-of-service
 https://io.adafruit.com/terms
 https://www.creditkarma.com/about/terms-20180122
 https://www.flickr.com/help/privacy
 https://www.rabb.it/tos
 https://www.coinbase.com/legal/user_agreement
 https://stackoverflow.com/legal/terms-of-service/public
 https://www.moddb.com/privacy-policy
 https://vk.com/privacy
 https://www.microsoft.com/en-us/legal/intellectualproperty/copyright/default.aspx
 https://account.samsung.com/membership/terms/termscontents
 https://trello.com/privacy
 https://vk.com/terms
 https://www.startpage.com/en/search/privacy-policy.html
 https://www.microsoft.com/en-us/servicesagreement/
 https://www.apple.com/legal/internet-services/icloud/en/terms.html
 https://www.tucowsdomains.com/help/legal-policies/
 https://www.rte.ie/about/en/policies-and-reports/policies-guidelines/2012/0417/317440-rte-privacy-statement/
(75 rows)

(END)
michielbdejong commented 4 years ago

From select id, service_id, url from documents where url in (select distinct a.url from documents a inner join documents b on a.url = b.url where a.id != b.id) order by url it seems that a lot of duplicate documents happen within the same service. So let me try to clean that up first.