huridocs / uwazi

Uwazi is a web-based, open-source solution for building and sharing document collections
http://www.uwazi.io
MIT License
225 stars 80 forks source link

Scraping web articles into private Uwazi instance #6613

Open pddocs opened 3 months ago

pddocs commented 3 months ago

A partner is working with a new scraping tool, Zyte. They would like to use this tool to set up a scraping workflow such that they are able to bring web articles into their existing templates inside a specific Uwazi instance. Could you please support our partner @thebutcher00 with this set up? Thank you.

Additional context: A scraping set up had been in place using a script written by their ex-colleague, thanks to our dev team's support over Github. See the thread here (#1 ; #2 ;#3990 ; #3989 ).This set up now however needs to be updated to a) include more websites to scrape from b) to replace the existing system as no longer is there to maintain the script anymore.

@natasha-todi @RafaPolit

RafaPolit commented 1 month ago

@pddocs how do you want us to support them? Just be on-the-listen to any requests they have or do they actually have any current requirements that require support?

cc @gabriel-piles

pddocs commented 1 month ago

thank you for the follow up @RafaPolit

Please be on on-the-listen mode until @thebutcher00 comments here for your concrete instructions/advise.

Do note, this is not an active project.