unitedstates / inspectors-general

Collecting reports from Inspectors General across the US federal government.
https://sunlightfoundation.com/blog/2014/11/07/opengov-voices-opening-up-government-reports-through-teamwork-and-open-data/
Creative Commons Zero v1.0 Universal
107 stars 21 forks source link

[education] Scraper is hanging #297

Closed divergentdave closed 7 years ago

divergentdave commented 7 years ago

For some reason scrapers are intermittently hanging. I noticed this yesterday when enough hung processes had piled up to exhaust memory, and this started showing up on the dashboard when subprocesses wouldn't fork anymore.

I spun up a fresh machine last night, and right now there are two lingering scraper processes. (see below) In the same time, six cron jobs ran successfully. strace says both processes are reading from a socket, so presumably we are doing some network access without a timeout.

    1  1127  1127  1127 ?           -1 Ss       0   0:00 /usr/sbin/cron -f
 1127 28348  1127  1127 ?           -1 S        0   0:00  \_ /usr/sbin/CRON -f
28348 28349 28349 28349 ?           -1 Ss    1000   0:00  |   \_ /bin/sh -c bash -c '...
28349 28350 28349 28349 ?           -1 S     1000   0:00  |       \_ bash -c cd ...
28350 28527 28349 28349 ?           -1 S     1000   5:23  |           \_ python igs
 1127 31346  1127  1127 ?           -1 S        0   0:00  \_ /usr/sbin/CRON -f
31346 31347 31347 31347 ?           -1 Ss    1000   0:00      \_ /bin/sh -c bash -c '...
31347 31348 31347 31347 ?           -1 S     1000   0:00          \_ bash -c cd ...
31348 31525 31347 31347 ?           -1 S     1000   5:24              \_ python igs
divergentdave commented 7 years ago

According to gdb, one process was reading from an SSL socket.

Here's what I got from /proc/<PID>/net/tcp:

  sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode
 269: F2391FAC:DBD0 2015E0A5:01BB 01 00000000:00000000 00:00000000 00000000  1000        0 1629135 1 ffff8800373f6900 708 4 26 2 -1

Edit: local_address is 172.31.57.242:53467 remote_address is 165.224.21.32:47873 I may well have gotten the port number the wrong way around, but oh well. The remote host is ed.gov, so I'll look into that more.

divergentdave commented 7 years ago

I killed one of the hung processes, and this was the tail end of the log.

https://www2.ed.gov/about/offices/list/oig/otheroigproducts.html
[other][2016-09-26][x21q0001]
        report: education/2016/x21q0001/report.pdf
        text: education/2016/x21q0001/report.txt
        data: education/2016/x21q0001/report.json
[other][2016-08-15][coveredsystems088152016]
        report: education/2016/coveredsystems088152016/report.pdf
        text: education/2016/coveredsystems088152016/report.txt
        data: education/2016/coveredsystems088152016/report.json
[other][2016-02-29][scrareport02292016]
        report: education/2016/scrareport02292016/report.pdf
        text: education/2016/scrareport02292016/report.txt
        data: education/2016/scrareport02292016/report.json
https://www2.ed.gov/about/offices/list/oig/specialreportstocongress.html
https://www2.ed.gov/about/offices/list/oig/testimon.html
https://www2.ed.gov/about/offices/list/oig/ireports.html
bash: line 1: 31525 Terminated              python igs