privacy-tech-lab / gpc-web-crawler

Web crawler for detecting websites' compliance with GPC privacy preference signals at scale
https://privacytechlab.org/
MIT License
4 stars 2 forks source link

Run in a Docker container #98

Open dmarti opened 7 months ago

dmarti commented 7 months ago

For regular testing use it would be useful to run in a container with one command. I have the REST API and extension build steps largely working, and am working out how to do the actual crawl. Work in progress...

https://github.com/privacy-tech-lab/gpc-web-crawler/compare/main...dmarti:gpc-web-crawler:dockerize?expand=1

Not ready to discuss or merge yet, just wanted to see if there is interest. In the long run I'd like to be able to run the crawler as a service that doesn't need much attention, just sends reports if a site being watched has broken GPC.

SebastianZimmeck commented 7 months ago

It is a good point, @dmarti! We discussed the Dockerization and also think it is a good idea to explore. @sophieeng will take the lead on our end with the help of @katehausladen, @Mattm27, and @franciscawijaya helping out.

dmarti commented 7 months ago

Thank you -- right now I think in order to get the crawl working I need to modify my Dockerfile to get the right versions of Firefox Nightly and geckodriver installed.

What versions are you running and what source are you using for geckodriver? (I haven't used Selenium in a while and it seems like things have moved around, I just want to go to the right place)

katehausladen commented 7 months ago

Currently we're just using whatever Firefox version is on the computer locally (.setBinary('/Applications/Firefox\ Nightly.app/Contents/MacOS/firefox')), and Selenium uses the geckodriver from the local Firefox Nightly. So, this means we're always using the most recent version of both. In terms of Docker, I think as long as you use something relatively recent, it should be fine.

sophieeng commented 6 months ago

Hey @dmarti! Just wanted to check in on your progress with running the crawler in Docker. Do you need any support? Let us know if we can help with anything.

dmarti commented 5 months ago

Hi @sophieeng I got a little stuck figuring out the right source code and/or Linux packages for Selenium. It didn't look like geckodriver was packaged with the Firefox Nightly for Linux download that I was using

dmarti commented 5 months ago

If I can get source for known good Firefox and Selenium downloads that work together that would help (I don't have a Mac to test on)

katehausladen commented 5 months ago

@franciscawijaya and @Mattm27 will work on this over the next couple of weeks.

SebastianZimmeck commented 3 months ago

@dmarti is using Linux and not macOS. So, @dmarti's question is which Selenium and Firefox version work for Linux.

Sokvy77 commented 3 months ago

Vẫn la tới

Vào Th 6, 14 thg 6, 2024 lúc 7:32 SA Sebastian Zimmeck < @.***> đã viết:

@dmarti https://github.com/dmarti is using Linux and not macOS. So, @dmarti https://github.com/dmarti's question https://github.com/privacy-tech-lab/gpc-web-crawler/issues/98#issuecomment-2050367872 is which Selenium and Firefox version work for Linux.

— Reply to this email directly, view it on GitHub https://github.com/privacy-tech-lab/gpc-web-crawler/issues/98#issuecomment-2167002594, or unsubscribe https://github.com/notifications/unsubscribe-auth/BILHNXI3ONZ6Z4TG7NCCLH3ZHI2ZHAVCNFSM6AAAAABEEQEFGWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRXGAYDENJZGQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Mattm27 commented 1 week ago

Hey @dmarti! We have resumed our efforts on the dockerization of our web crawler, and I’ve been reviewing the progress you made in the spring as a foundation for our work.

However, I noticed that the myextension.xpi file has been removed from the codebase, and I'm a bit unclear about the reasoning behind this change. Could you please provide some clarification on why it was deleted? Thanks!

Mattm27 commented 4 days ago

I've been dealing with two main problems: the container closing immediately after starting (exit code 255) and conflicts between MySQL and MariaDB installations. The container issue seems related to how systemd is set up, while the MySQL vs. MariaDB problem is likely due to package conflicts.

In terms of next steps, the plan is to create a new Dockerfile from scratch, focusing on getting the container to stay running first, and then adding Apache, Geckodriver, either MySQL or MariaDB, etc... one step at a time to avoid further conflicts. Once these issues are resolved, the Dockerization should be much closer to completion, and the existing .sh scripts should work correctly with a stable setup.

Mattm27 commented 4 days ago

The Docker image itself is building successfully, meaning all the required dependencies and configurations are being included properly. However, the issue arises when creating a container from that image—systemd is failing to initialize, causing the container to immediately exit. This distinction is important because it suggests the problem is not with the build process, but rather with how the container is running or managing processes once started.

SebastianZimmeck commented 4 days ago

As discussed in our meeting, @Mattm27 will start a fresh Docker implementation.