mediacloud / backend

Media Cloud is an open source, open data platform that allows researchers to answer quantitative questions about the content of online media.
http://www.mediacloud.org
GNU Affero General Public License v3.0
277 stars 87 forks source link

docker compose up not starting the media cloud in production as stated in the docs. #717

Closed esirK closed 4 years ago

esirK commented 4 years ago

Hello. Firstly, thank you for the amazing tool that you have created for us. My team and I are trying to run our own instance of Media Cloud and would appreciate some help on the same. Following the documentation provided, we have been able to arrive at the following point

Screenshot 2020-06-24 at 09 35 30

We went ahead and pulled the pre-built images and as stated on how to run Media Cloud in production, we copied the docker-compose.dist.yml and proceeded to run the docker-compose up command. This is the point we are experiencing difficulties at. Attached are the error logs Screenshot 2020-06-24 at 14 40 58 Screenshot 2020-06-24 at 14 41 56

Is there something we are currently missing that should make this work?

pypt commented 4 years ago

Hey @esirK, dunno, I think this is Docker's own issue or misconfigured host OS. Try increasing port range, changing subnet used by containers, and / or upgrading Docker or your host OS.

esirK commented 4 years ago

Hey @esirK, dunno, I think this is Docker's own issue or misconfigured host OS. Try increasing port range, changing subnet used by containers, and / or upgrading Docker or your host OS.

Thanks @pypt for the response. I will try this out and get back to you.

esirK commented 4 years ago

Hello @pypt the two approaches failed. I am currently using Docker version 19.03.4, build 9013bf583a . Which version are you currently using? I can use the exact version and see if it will work.

pypt commented 4 years ago

Could you provide some logs? Also, what is it that you're trying to achieve by running Media Cloud?

pypt commented 4 years ago

Thanks for the writeup, very interesting!

So, would you like to use Media Cloud, or run an instance of Media Cloud yourselves, or?..

esirK commented 4 years ago

@pypt For more context, I am part of the technologists from CodeForAfrica https://github.com/CodeForAfrica/ and we've been significantly expanding our support for misinfo projects in Africa over the past 4yrs (after seeding the continent's first fact-checking initiative, AfricaCheck, back in 2012, and more recently underwriting the establishment of newsroom-based fact-check desks such as https://pigafirimbi.africauncensored.online/). All of this external support is driven by our own inhouse misinfo 'lab', PesaCheck, which currently has full-time research staff in 15 African countries and that supports fact-check or investigative data journalism desks at 50+ partner newsrooms and watchdog NGOs. Separate from this, our disinfo data science team, the iLAB, helps incubate the DFRLabs in Africa (which produce this kind of research) and works closely with folk like https://disinformationindex.org/ to help 'harden' business systems in media to make them less susceptible to manipulation. This combined mis-/disinfo work has now reached a scale-of-economy that we're broadening the scope to tackle malign content designed to polarise African societies, ranging from hate speech to radicalization/militancy (around religion, ethnic and racial fault lines, etc). We've started building lexicons for detecting the ever-evolving trigger language, etc, and are also mapping both PEP/PIP entities who help drive this content.

We want to use the combined resources for real-time intel with actionable insights/data for newsrooms & civic watchdogs in the run-up to elections in Ethiopia, Kenya, Mali, Niger, CAR, etc, over the next year, while also launching full-scale monitoring for coordinated campaigns by both State-level actors (Saudi, Russia, China, India, Israel, Turkey, etc) and by organized crime (like our work that was cited in this NYT piece). And, as longer-tail outputs, we want to plug in our more traditional 'share-of-voice' media research folk across the network to drive substantive media analysis on everything from gender to topics such as climate/crime, etc, in Africa's media. Finally, we also want API integration to drive tools like our PromiseTrackers (which monitor promises by politicos, etc). We'll do this incrementally, to ensure coherence. But, even then, the core underlying tools are obviously going to be mission-critical. Basically, so far, we've been using a combo of two tools that we seed-funded some years back, Aleph (https://github.com/alephdata) and Dexter (https://github.com/Code4SA/mma-dexter/wiki), for a range of election monitoring, media monitoring/analysis and investigative entity mapping/analysis. They've been impactful, steering everything from our Panama Papers to mafia investigations. Dexter was a great proof-of-concept but always an inadequate stopgap measure that we intended to scale up. The code is, frankly, poorly conceived/executed, and trying to force Aleph to do stuff if wasn't designed for is becoming too cumbersome.

So, after scoping what's out there, we've figured out that MediaCloud seems to be the most robust foundation for us to start building on. The aim is to spin up an instance that replaces Dexter, and that integrates with Aleph for entity data, etc. In addition to monitoring African media sources, we'd use our instance of MediaCloud to also ingest parliamentary/Hansard transcripts and political partner communique, key blogs and UGC fora (which are pivotal in places like Tanzania), etc.

We've also been chatting to Rahul Bhargava and Ethan Zuckerman (over emails) and they understand our project background. Thank you.

esirK commented 4 years ago

Thanks for the writeup, very interesting!

So, would you like to use Media Cloud, or run an instance of Media Cloud yourselves, or?..

Sorry I had deleted the writeup by mistake. We would like to run an instance of Media Cloud.

pypt commented 4 years ago

I still suspect that you're encountering Docker problems:

We currently run Media Cloud on 8 or so servers (via Docker Swarm), and the sample docker-compose.yml is somewhat tailored to that fact, so running everything on a single machine will require you to edit the Compose configuration quite a bit.

esirK commented 4 years ago

Thank you @pypt for your response. Concerning Which Linux distribution are you running? What's your kernel version? I am running on Ubuntu 18.4 on a virtual machine. I am also trying to run it directly from Mac os

ERROR: for apps_proxy-cron-certbot_1 Cannot create container for service proxy-cron-certbot: failed to mount local volume: mount /space/mediacloud/vol_proxy_ssl_certs:/var/lib/docker/volumes/apps_vol_proxy_ssl_certs/_data, flags: 0x1000: no such file or directory Creating apps_postgresql-server_1 ... error ERROR: for apps_rabbitmq-server_1 Cannot create container for service rabbitmq-server: failed to mount local volume: mount /space/mediacloud/vol_rabbitmq_data:/var/lib/docker/volumes/apps_vol_rabbitmq_data/_data, flags: 0x1000: no such file or directory

Creating apps_mail-opendkim-server_1 ... error Creating apps_topics-snapshot_1 ... error ERROR: for apps_mail-opendkim-server_1 Cannot create container for service mail-opendkim-server: failed to mount local volume: mount /space/mediacloud/vol_opendkim_config:/var/lib/docker/volumes/apps_vol_opendkim_config/_data, flags: 0x1000: no such file or directory

ERROR: for apps_topics-fetch-link_1 Cannot start service topics-fetch-link: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"write sysctl key net.ipv4.ip_local_port_range: write /proc/sys/net/ipv4/ip_local_port_range: invalid argument\"": unknown Creating apps_sitemap-fetch-media-pages_1 ... error ERROR: for apps_solr-zookeeper_1 Cannot start service solr-zookeeper: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"write sysctl key net.ipv4.ip_local_port_range: write /proc/sys/net/ipv4/ip_local_port_range: invalid argument\"": unknown Creating apps_extract-article-from-page_1 ... error ERROR: for apps_sitemap-fetch-media-pages_1 Cannot start service sitemap-fetch-media-pages: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"write sysctl key net.ipv4.ip_local_port_range: Creating apps_webapp-api_1 ... error

ERROR: for apps_topics-snapshot_1 Cannot start service topics-snapshot: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"write sysctl key net.ipv4.ip_local_port_range: write /proc/sys/net/ipv4/ip_local_port_range: invalid argument\"": unknown

ERROR: for apps_extract-article-from-page_1 Cannot start service extract-article-from-page: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"write sysctl key net.ipv4.ip_local_port_range: write /proc/sys/net/ipv4/ip_local_port_range: invalid argument\"": unknown

ERROR: for apps_webapp-api_1 Cannot start service webapp-api: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"write sysctl key net.ipv4.ip_local_port_range: write /proc/sys/net/ipv4/ip_local_port_range: invalid argument\"": unknown

ERROR: for webapp-httpd Cannot create container for service webapp-httpd: failed to mount local volume: mount /space/mediacloud/vol_daily_rss_dumps:/var/lib/docker/volumes/apps_vol_daily_rss_dumps/_data, flags: 0x1000: no such file or directory

ERROR: for proxy-cron-certbot Cannot create container for service proxy-cron-certbot: failed to mount local volume: mount /space/mediacloud/vol_proxy_ssl_certs:/var/lib/docker/volumes/apps_vol_proxy_ssl_certs/_data, flags: 0x1000: no such file or directory

ERROR: for rabbitmq-server Cannot create container for service rabbitmq-server: failed to mount local volume: mount /space/mediacloud/vol_rabbitmq_data:/var/lib/docker/volumes/apps_vol_rabbitmq_data/_data, flags: 0x1000: no such file or directory

ERROR: for postgresql-server Cannot create container for service postgresql-server: failed to mount local volume: mount /space/mediacloud/vol_postgresql_data:/var/lib/docker/volumes/apps_vol_postgresql_data/_data, flags: 0x1000: no such file or directory

ERROR: for mail-opendkim-server Cannot create container for service mail-opendkim-server: failed to mount local volume: mount /space/mediacloud/vol_opendkim_config:/var/lib/docker/volumes/apps_vol_opendkim_config/_data, flags: 0x1000: no such file or directory

ERROR: for topics-fetch-link Cannot start service topics-fetch-link: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"write sysctl key net.ipv4.ip_local_port_range: write /proc/sys/net/ipv4/ip_local_port_range: invalid argument\"": unknown

ERROR: for solr-zookeeper Cannot start service solr-zookeeper: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"write sysctl key net.ipv4.ip_local_port_range: write /proc/sys/net/ipv4/ip_local_port_range: invalid argument\"": unknown

ERROR: for sitemap-fetch-media-pages Cannot start service sitemap-fetch-media-pages: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"write sysctl key net.ipv4.ip_local_port_range: write /proc/sys/net/ipv4/ip_local_port_range: invalid argument\"": unknown

ERROR: for topics-snapshot Cannot start service topics-snapshot: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"write sysctl key net.ipv4.ip_local_port_range: write /proc/sys/net/ipv4/ip_local_port_range: invalid argument\"": unknown

ERROR: for extract-article-from-page Cannot start service extract-article-from-page: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"write sysctl key net.ipv4.ip_local_port_range: write /proc/sys/net/ipv4/ip_local_port_range: invalid argument\"": unknown

ERROR: for webapp-api Cannot start service webapp-api: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"write sysctl key net.ipv4.ip_local_port_range: write /proc/sys/net/ipv4/ip_local_port_range: invalid argument\"": unknown ERROR: Encountered errors while bringing up the project.


However, when I use `docker stack deploy -c apps/docker-compose.dist.yml mediacloud`, the swarm starts but all services are in a pending state with an error of `no suitable node ` which makes me think it has something to do with the requirements e.g resources of some of the containers.
I have set all replicas to `1` and reduced all CPUs to 1 and memory to 256M. I have also commented out all `nytlabels*`, `cliff-*`, and `podcast-*` services. Finally, I have reduced the amount of Solr shards.
A couple of questions.
- What is the minimum system requirements in order to be able to run the services on a single machine
- Which are the core/required services that I must have in order to get the system up and is it okay to comment out the rest of the services?
esirK commented 4 years ago

Hello, @pypt If I manage to get 8+ servers, will the compose file work?

pypt commented 4 years ago

Among other errors, you get a few of cases when Docker is unable to mount a directory from host to the container as a volume, e.g.:

ERROR: for apps_mail-opendkim-server_1  Cannot create container for service mail-opendkim-server: failed to mount local volume: mount /space/mediacloud/vol_opendkim_config:/var/lib/docker/volumes/apps_vol_opendkim_config/_data, flags: 0x1000: no such file or directory

so make sure those directories exist on host or edit docker-compose.yml accordingly.

You might be getting no suitable node error because your some Swarm node (your computer basically) doesn't have necessary labels. Those get assigned via Ansible provisioning playbook available here:

https://github.com/berkmancenter/mediacloud/tree/master/provision

See inventory/hosts.sample.yml for an example of how hosts are configured and try provisioning your own machine with this playbook (you might need some minor adjustments here and there as the playbook is made to work on Ubuntu 16.04 and you're on 18.04).

As for the min. system requirements, I don't know TBH. We've made this thing to be able to run it ourselves so we haven't put much thought into making it easy for others to run it, e.g. we didn't find out what's the smallest machine that this could run on. My guess would be that a typical laptop could manage running Media Cloud fine with just a few media sources added to it and a bunch of extra services (e.g. heavy CLIFF and NYTLabels) disabled altogether.

esirK commented 4 years ago

Hello @pypt My kernel version is 4.15.0-112-generic My default net.ipv4.ip_local_port_range is 32768 60999, I have changed this to different values e.g 40000 65535, 35536 60999, and even 1024 65535 but I still receive the same errors for all the services

ERROR: for webapp-httpd  Cannot start service webapp-httpd: OCI runtime create failed: container_linux.go:346: starting container process caused "process_linux.go:449: container init caused \"write sysctl key net.ipv4.ip_local_port_range: write /proc/sys/net/ipv4/ip_local_port_range: invalid argument\"": unknown

I also upgraded my docker to the latest version 19.03.12 I have also commented out most of the services. Here is what I currently have on the compose file https://github.com/esirK/mediacloud/blob/82686bbf6350238e45985efba38b400b908cddcb/apps/docker-compose.dist.yml I was able to fix the no suitable node by adding the labels as you suggested. I have also tried changing the subnets used by containers to - subnet: "10.0.0.0/8" which causes the services to remain in New state without running. I am still trying to make this work but If you get a chance to look at my compose file I'll appreciate you letting me know if there is anything I can change or add. Thank you.

esirK commented 4 years ago

@pypt I was able to do the deployment but I had to comment out the - net.ipv4.ip_local_port_range="1024 65500" inside x-sysctl-defaults: &sysctl-defaults and also

# ipam:
        #     driver: default
        #     config:
        #         # Docker (Compose?) sometimes defaults to a subnet with only
        #         # 255 available addresses
        #         #
        #         # If you change this subnet, make sure that you update it
        #         # elsewhere too, e.g. in "mail-opendkim-server"'s TrustedHosts
        #         # or "mail-postfix-server" Dockerfile
        #         - subnet: "10.0.0.0/8"

Thank you very much for the help provided. Will reach out again incase I face another blocker.

pypt commented 4 years ago

Great to hear that @esirK! Let us know if you encounter any other issues.

esirK commented 4 years ago

Hello, @pypt quick question. For the Explorer frontend app, is crawler-ap one of the required swarm services? and if so, I will be required to contact associate press for MC_CRAWLER_AP_API_KEY right?

hroberts commented 4 years ago

you do not need that service. you will just not get stories directly from the ap api, but you can still download app stories via their public RSS feeds like any other source. the public feeds have about a third of all so stories last time we checked.

On Tue, Aug 4, 2020, 11:08 AM esirK notifications@github.com wrote:

Hello, @pypt https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_pypt&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=WprZcmuZtlBoWnfcreCddKIVG9VOFDetQ1Ajg1pnkQw&s=Mm1dWpkF_XywbdzDidd2OFXkM_AQZNuY8QmILNzE_1A&e= quick question. For the Explorer frontend app, is crawler-ap one of the required swarm services? and if so, I will be required to contact associate press for MC_CRAWLER_AP_API_KEY right?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_issues_717-23issuecomment-2D668686175&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=WprZcmuZtlBoWnfcreCddKIVG9VOFDetQ1Ajg1pnkQw&s=_woKrEYah1BzF02wJnoNlbroI72kET2bxdG_55XXY6Y&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAN66T4E2NB25INHIU7WZK3R7AW6DANCNFSM4OHUI25A&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=WprZcmuZtlBoWnfcreCddKIVG9VOFDetQ1Ajg1pnkQw&s=jYsfjsNBEPq48HdIUfraVeY_jeY-XVokEgPXSuurATg&e= .

esirK commented 4 years ago

you do not need that service. you will just not get stories directly from the ap api, but you can still download app stories via their public RSS feeds like any other source. the public feeds have about a third of all so stories last time we checked. On Tue, Aug 4, 2020, 11:08 AM esirK @.**> wrote: Hello, @pypt https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_pypt&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=WprZcmuZtlBoWnfcreCddKIVG9VOFDetQ1Ajg1pnkQw&s=Mm1dWpkF_XywbdzDidd2OFXkM_AQZNuY8QmILNzE_1A&e= quick question. For the Explorer frontend app, is crawler-ap one of the required swarm services? and if so, I will be required to contact associate press* for MC_CRAWLER_AP_API_KEY right? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_berkmancenter_mediacloud_issues_717-23issuecomment-2D668686175&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=WprZcmuZtlBoWnfcreCddKIVG9VOFDetQ1Ajg1pnkQw&s=_woKrEYah1BzF02wJnoNlbroI72kET2bxdG_55XXY6Y&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAN66T4E2NB25INHIU7WZK3R7AW6DANCNFSM4OHUI25A&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=0c5FW2CrwCh84ocLICzUHjcwKK-QMUDy4RRw_n18mMo&m=WprZcmuZtlBoWnfcreCddKIVG9VOFDetQ1Ajg1pnkQw&s=jYsfjsNBEPq48HdIUfraVeY_jeY-XVokEgPXSuurATg&e= .

Thanks you @hroberts

esirK commented 4 years ago

I was able to get the instance running and using the source manager frontend app I added a couple of sources to the instance. However, no data is showing up on Solr and therefore the Explorer app doesn't get any data. My question, therefore, would be which services regularly fetches RSS feeds from the media sources? This is to make sure I have it running. cc @pypt @hroberts

esirK commented 4 years ago

I was able to get the instance running and using the source manager frontend app I added a couple of sources to the instance. However, no data is showing up on Solr and therefore the Explorer app doesn't get any data. My question, therefore, would be which services regularly fetches RSS feeds from the media sources? This is to make sure I have it running. cc @pypt @hroberts

also to add to this, I have apps_cron-generate-daily-rss-dumps_1 service running which I think might be the one to fetch the RSS feeds but still no data in Solr yet. cc @pypt @hroberts

pypt commented 4 years ago

Once you add a new media source, rescrape-media is supposed to scrape it looking for RSS / Atom feeds to add to feeds table.

Then, crawler-provider will periodically add jobs to PostgreSQL table, and instances of crawler-fetcher will fetch those jobs looking for new stories (look up pop_queued_download() PLPGSQL function).

cron-generate-daily-rss-dumps doesn't have much to do with any of this so you can safely disable it.

Let us know if you have any more questions.