privacy-tech-lab / gpc-web-crawler

Web crawler for detecting websites' compliance with GPC privacy preference signals at scale
https://privacytechlab.org/
MIT License
4 stars 3 forks source link

Run in a Docker container #98

Open dmarti opened 8 months ago

dmarti commented 8 months ago

For regular testing use it would be useful to run in a container with one command. I have the REST API and extension build steps largely working, and am working out how to do the actual crawl. Work in progress...

https://github.com/privacy-tech-lab/gpc-web-crawler/compare/main...dmarti:gpc-web-crawler:dockerize?expand=1

Not ready to discuss or merge yet, just wanted to see if there is interest. In the long run I'd like to be able to run the crawler as a service that doesn't need much attention, just sends reports if a site being watched has broken GPC.

SebastianZimmeck commented 8 months ago

It is a good point, @dmarti! We discussed the Dockerization and also think it is a good idea to explore. @sophieeng will take the lead on our end with the help of @katehausladen, @Mattm27, and @franciscawijaya helping out.

dmarti commented 8 months ago

Thank you -- right now I think in order to get the crawl working I need to modify my Dockerfile to get the right versions of Firefox Nightly and geckodriver installed.

What versions are you running and what source are you using for geckodriver? (I haven't used Selenium in a while and it seems like things have moved around, I just want to go to the right place)

katehausladen commented 8 months ago

Currently we're just using whatever Firefox version is on the computer locally (.setBinary('/Applications/Firefox\ Nightly.app/Contents/MacOS/firefox')), and Selenium uses the geckodriver from the local Firefox Nightly. So, this means we're always using the most recent version of both. In terms of Docker, I think as long as you use something relatively recent, it should be fine.

sophieeng commented 7 months ago

Hey @dmarti! Just wanted to check in on your progress with running the crawler in Docker. Do you need any support? Let us know if we can help with anything.

dmarti commented 7 months ago

Hi @sophieeng I got a little stuck figuring out the right source code and/or Linux packages for Selenium. It didn't look like geckodriver was packaged with the Firefox Nightly for Linux download that I was using

dmarti commented 7 months ago

If I can get source for known good Firefox and Selenium downloads that work together that would help (I don't have a Mac to test on)

katehausladen commented 6 months ago

@franciscawijaya and @Mattm27 will work on this over the next couple of weeks.

SebastianZimmeck commented 5 months ago

@dmarti is using Linux and not macOS. So, @dmarti's question is which Selenium and Firefox version work for Linux.

Sokvy77 commented 5 months ago

Vẫn la tới

Vào Th 6, 14 thg 6, 2024 lúc 7:32 SA Sebastian Zimmeck < @.***> đã viết:

@dmarti https://github.com/dmarti is using Linux and not macOS. So, @dmarti https://github.com/dmarti's question https://github.com/privacy-tech-lab/gpc-web-crawler/issues/98#issuecomment-2050367872 is which Selenium and Firefox version work for Linux.

— Reply to this email directly, view it on GitHub https://github.com/privacy-tech-lab/gpc-web-crawler/issues/98#issuecomment-2167002594, or unsubscribe https://github.com/notifications/unsubscribe-auth/BILHNXI3ONZ6Z4TG7NCCLH3ZHI2ZHAVCNFSM6AAAAABEEQEFGWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRXGAYDENJZGQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Mattm27 commented 1 month ago

Hey @dmarti! We have resumed our efforts on the dockerization of our web crawler, and I’ve been reviewing the progress you made in the spring as a foundation for our work.

However, I noticed that the myextension.xpi file has been removed from the codebase, and I'm a bit unclear about the reasoning behind this change. Could you please provide some clarification on why it was deleted? Thanks!

Mattm27 commented 1 month ago

I've been dealing with two main problems: the container closing immediately after starting (exit code 255) and conflicts between MySQL and MariaDB installations. The container issue seems related to how systemd is set up, while the MySQL vs. MariaDB problem is likely due to package conflicts.

In terms of next steps, the plan is to create a new Dockerfile from scratch, focusing on getting the container to stay running first, and then adding Apache, Geckodriver, either MySQL or MariaDB, etc... one step at a time to avoid further conflicts. Once these issues are resolved, the Dockerization should be much closer to completion, and the existing .sh scripts should work correctly with a stable setup.

Mattm27 commented 1 month ago

The Docker image itself is building successfully, meaning all the required dependencies and configurations are being included properly. However, the issue arises when creating a container from that image—systemd is failing to initialize, causing the container to immediately exit. This distinction is important because it suggests the problem is not with the build process, but rather with how the container is running or managing processes once started.

SebastianZimmeck commented 1 month ago

As discussed in our meeting, @Mattm27 will start a fresh Docker implementation.

Mattm27 commented 1 month ago

I was successfully able to build the Docker image, and the container now runs continuously without stopping unexpectedly. This was achieved by using CMD ["sleep", "infinity"] for testing purposes to keep the container alive.

Successfully installed and verified the following components within the container:

These installations are all functioning correctly, and the container is stable during testing.

I'm still having trouble installing MySQL. The container is currently not able to locate the mysql-server package, likely due to what I expect is a repository issue. However, as discussed in the meeting earlier this week, I was able to install MariaDB as an alternative. Since MariaDB is a drop-in replacement for MySQL, we can explore using it if we cannot resolve the MySQL installation directly.

SebastianZimmeck commented 1 month ago

Good progress, @Mattm27!

Mattm27 commented 1 month ago

I've made updates to the Dockerization process as outlined in the code above. The container is now being built correctly using the updated image, and I am in the process of testing individual crawler components within the container. The rest-api.sh script is functioning as expected and building the database, but I am currently troubleshooting issues with the build-extension.sh script to ensure the extension is properly built and integrated.

Mattm27 commented 1 month ago

I managed to work around the issue where the myextension.xpi file was being repacked every time the software was run in the Docker container. This repacking process was causing errors. Instead of repacking the extension, I used the prepacked myextension.xpi that is already present in the codebase. This allowed me to bypass the issues with corrupt extensions.

After resolving the extension issue, I’ve run into a new problem with Firefox when attempting to run the crawl. The Firefox browser does not seem to be functioning properly in the container, preventing the crawl from executing as expected. I'm currently troubleshooting the setup for Firefox Nightly, which is required for the extension, to ensure it's correctly installed and configured for headless mode but suspect I may need some extra support on this part of the process.

SebastianZimmeck commented 1 month ago

As we discussed, @eakubilo will also help with the dockerization.

@dmarti, we are currently having an issue getting Firefox Nightly to run in Docker. As @Mattm27 is saying:

The Firefox browser does not seem to be functioning properly in the container, ...

Do you have any thoughts on this?

dmarti commented 1 month ago

So the browser is starting in headless mode inside the container?

Does debugging protocol work? Can you expose the debugging protocol from Firefox inside the container?

https://firefox-source-docs.mozilla.org/devtools/backend/protocol.html

Mattm27 commented 1 month ago

Hey @dmarti! Yes, currently the browser is starting in headless mode inside the container, but we typically run Firefox in headful mode for the crawl. I haven’t yet explored exposing the debugging protocol from Firefox within the container, but it's something I can definitely look into.

I apologize for the oversight in the past comment. Our main priority right now is getting Firefox Nightly installed and running correctly, since the extension required for the crawl only functions in Nightly. While I was able to install standard Firefox inside the container, it wasn’t functioning properly, which is concerning, but ultimately, we need to focus on ensuring Firefox Nightly is installed and runs in headful mode for the crawler. I'm currently updating the Docker setup to address this. Do you have any suggestions for how we can go about doing this? It doesn't seem as straightforward when compared to installing other applications.

Thanks for the suggestion on the debugging protocol—I'll revisit that once we’ve got Nightly working as expected.

dmarti commented 4 weeks ago

@Mattm27 This is interesting -- in order to run non-headless, Firefox needs to be provided with a working GUI environment, which could mean connecting it out to the host. One option that people seem to be using is VNC inside the container -- if you commit your working Dockerfile I can try to add it, or you can see if one of these can be adapted to run Nightly...

https://github.com/ConSol/docker-headless-vnc-container

Mattm27 commented 4 weeks ago

Thanks for the information @dmarti! - My updated Dockerfile is committed to branch issue-98. I will also check out the link to see if it is possible to adapt that code to run Nightly!

eakubilo commented 3 weeks ago

@dmarti We were able to leverage the docker-headless-vnc-container for our needs, thank you for the suggestion! The branch issue-98 should have the functionality described - the command sh scripts/test.sh will open a docker container that performs the privacy crawl. We're working on a PR #138 which hopefully will have this functionality in main soon.

SebastianZimmeck commented 3 weeks ago

@eakubilo opened PR #138. @eakubilo explained that running the crawler with Docker works well for Intel Macs, however, throws an inscrutable error for Apple Silicon Macs. Thus, in addition to @Mattm27 and @eakubilo, @natelevinson10 and @franciscawijaya will try it out on their computers and the lab computer.

@eakubilo provided the following instructions on how to run the crawler on Docker:

Mattm27 commented 2 weeks ago

Now that we have merged the new functional Docker infrastructure, myself and @eakubilo will work on updating the readme with proper installation instructions before closing this issue!

Mattm27 commented 2 weeks ago

Since the Docker Image is starting the crawler in debug mode by default, myself and @eakubilo are working on adding functionality to allow users to start the web crawler with or without the debugging table by running either sh scripts/webcrawler.sh or sh scripts/webcrawler.sh debug. The goal is to pass a variable from the command into the container using a flag, enabling more control over the crawler's behavior during startup.

SebastianZimmeck commented 2 weeks ago

@eakubilo and @Mattm27 will finish the Dockerization, including updating the readme and any other documentation, such that we can start the crawl (#16) next week Monday.

@franciscawijaya and @natelevinson10 will try out if they can follow the readme and install the Docker version on their own computers and the lab computer for next week's crawl.