Open dmarti opened 7 months ago
It is a good point, @dmarti! We discussed the Dockerization and also think it is a good idea to explore. @sophieeng will take the lead on our end with the help of @katehausladen, @Mattm27, and @franciscawijaya helping out.
Thank you -- right now I think in order to get the crawl working I need to modify my Dockerfile to get the right versions of Firefox Nightly and geckodriver installed.
What versions are you running and what source are you using for geckodriver? (I haven't used Selenium in a while and it seems like things have moved around, I just want to go to the right place)
Currently we're just using whatever Firefox version is on the computer locally (.setBinary('/Applications/Firefox\ Nightly.app/Contents/MacOS/firefox')), and Selenium uses the geckodriver from the local Firefox Nightly. So, this means we're always using the most recent version of both. In terms of Docker, I think as long as you use something relatively recent, it should be fine.
Hey @dmarti! Just wanted to check in on your progress with running the crawler in Docker. Do you need any support? Let us know if we can help with anything.
Hi @sophieeng I got a little stuck figuring out the right source code and/or Linux packages for Selenium. It didn't look like geckodriver was packaged with the Firefox Nightly for Linux download that I was using
If I can get source for known good Firefox and Selenium downloads that work together that would help (I don't have a Mac to test on)
@franciscawijaya and @Mattm27 will work on this over the next couple of weeks.
@dmarti is using Linux and not macOS. So, @dmarti's question is which Selenium and Firefox version work for Linux.
Vẫn la tới
Vào Th 6, 14 thg 6, 2024 lúc 7:32 SA Sebastian Zimmeck < @.***> đã viết:
@dmarti https://github.com/dmarti is using Linux and not macOS. So, @dmarti https://github.com/dmarti's question https://github.com/privacy-tech-lab/gpc-web-crawler/issues/98#issuecomment-2050367872 is which Selenium and Firefox version work for Linux.
— Reply to this email directly, view it on GitHub https://github.com/privacy-tech-lab/gpc-web-crawler/issues/98#issuecomment-2167002594, or unsubscribe https://github.com/notifications/unsubscribe-auth/BILHNXI3ONZ6Z4TG7NCCLH3ZHI2ZHAVCNFSM6AAAAABEEQEFGWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRXGAYDENJZGQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Hey @dmarti! We have resumed our efforts on the dockerization of our web crawler, and I’ve been reviewing the progress you made in the spring as a foundation for our work.
However, I noticed that the myextension.xpi file has been removed from the codebase, and I'm a bit unclear about the reasoning behind this change. Could you please provide some clarification on why it was deleted? Thanks!
I've been dealing with two main problems: the container closing immediately after starting (exit code 255) and conflicts between MySQL and MariaDB installations. The container issue seems related to how systemd is set up, while the MySQL vs. MariaDB problem is likely due to package conflicts.
In terms of next steps, the plan is to create a new Dockerfile from scratch, focusing on getting the container to stay running first, and then adding Apache, Geckodriver, either MySQL or MariaDB, etc... one step at a time to avoid further conflicts. Once these issues are resolved, the Dockerization should be much closer to completion, and the existing .sh
scripts should work correctly with a stable setup.
The Docker image itself is building successfully, meaning all the required dependencies and configurations are being included properly. However, the issue arises when creating a container from that image—systemd is failing to initialize, causing the container to immediately exit. This distinction is important because it suggests the problem is not with the build process, but rather with how the container is running or managing processes once started.
As discussed in our meeting, @Mattm27 will start a fresh Docker implementation.
For regular testing use it would be useful to run in a container with one command. I have the REST API and extension build steps largely working, and am working out how to do the actual crawl. Work in progress...
https://github.com/privacy-tech-lab/gpc-web-crawler/compare/main...dmarti:gpc-web-crawler:dockerize?expand=1
Not ready to discuss or merge yet, just wanted to see if there is interest. In the long run I'd like to be able to run the crawler as a service that doesn't need much attention, just sends reports if a site being watched has broken GPC.