tosdr / edit.tosdr.org

πŸ‘πŸ‘Ž A new web app to rate services
https://edit.tosdr.org
GNU Affero General Public License v3.0
213 stars 37 forks source link

Use ambanum/CGUs for crawling #928

Closed michielbdejong closed 3 years ago

michielbdejong commented 4 years ago

Or maybe we should stop putting crawled text in postgres and just let edit.tosdr.org load it from GitHub? At least we need a script that checks if point quotes need to be updated.

Maybe 'Crawl Document' should just fetch&save the markdown, like https://raw.githubusercontent.com/ambanum/CGUs-versions/master/Facebook/Commercial%20Terms.md

What to do if a document is not on ambanum/CGUs? Run a shadow instance?

michielbdejong commented 4 years ago

I should first get https://github.com/ambanum/CGUs/pull/71 merged with the limited number of docs there. Then I should create a branch which has all the docs from both tosback and ETO (also ones without doc type). Then I should set up an instance on Heroku that runs my branch. should it export to git? I can fork ambanum/CGUs-versions and ambanum/CGUs-snapshots so that we can use those as the data source for both tosback and ETO.

MattiSG commented 4 years ago

What to do if a document is not on ambanum/CGUs? Run a shadow instance?

I guess it depends how easily you want contributors to add documents vs how much these documents should be reviewed. You could very well have a separate instance of CGUs with your own service declarations, and regular upstream imports into the β€œmain” CGUs instance that could then publish to CGUs-versions πŸ™‚

I'm curious about what is ETO!

michielbdejong commented 4 years ago

Ah, ETO is just "edit.tosdr.org" :)

michielbdejong commented 4 years ago

I guess we should add all rules that pass validation to ambanum/CGUs. https://github.com/ambanum/CGUs/pull/88#issuecomment-674035419 already helps a lot. I'll see how far I can get. If a document is not admitted in ambanum/CGUs then maybe it should also not be admitted in edit.tosdr.org.

michielbdejong commented 4 years ago
  1. import all historical snapshots from tosback into ambanum/CGUs, so that we don't have to look at that for crawls data anymore
  2. once that's done, make sure that for each URL we imported, if it still exists, we keep crawling it
  3. switch over tosback.org to use ambanum's crawler
  4. integration with edit.tosdr.org. this will be the trickier part:
    • introduce point state 'quote-not-found' and improve workflow around that
    • script that ports versions git -> postgres?
    • switch document model from .xpath to .select
    • script that ports rules postgres -> git?
michielbdejong commented 3 years ago

I switched over https://tosback.org today. Will edit the edit.tosdr.org code to:

Then:

michielbdejong commented 3 years ago

Working on https://github.com/tosdr/edit.tosdr.org/pull/956 first as it's related and something I also want to add.