iridium-soda / metadata_crawler_docker

To deploy the crawler with docker image.
MIT License
1 stars 2 forks source link

proxy IP addresses are expiring frequently and need to be re-fetched #1

Closed ArcherSore closed 6 months ago

ArcherSore commented 6 months ago

Issue Description: Based on current test results, the issue only occurs when crawling all_images_b, and not when crawling _-019ac. When crawling all_images_b, the first 12 entries proceed normally, but starting from the 13th entry, frequent issues with expired IPs are encountered.

Vulnerability Reproduction: Ubuntu 22.04, with the basic environment configuration as described in the readme.

The docker-compose.yaml file is as follows:

version: "3.8"
services:
  crawler:
    build: .
    image: crawler:latest
    container_name: metadata_b
    networks:
      - meta
    environment:
      - PREFIX=b
      - API_URL=http://api.proxy.ip2world.com/getProxyIp?lb=4&return_type=txt&protocol=http&num=1
      - MONGO_HOST=mongo
      - MONGO_PORT=27017
      - DB_NAME=metadata
    volumes:
      - ./../docker_images/data:/data
    deploy:
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
  mongo:
    image: mongo:latest
    container_name: mongo
    networks:
      - meta
    volumes:
      - ./../database:/data/db
      - ./data:/result
    #command: mongosh  --eval "use ${DB_NAME:-metadata};"
    restart: unless-stopped
networks:
  meta:
    driver: bridge

Run the command sudo docker-compose up -d.

Error Log Information:

{"log":"[WARNING] 2024-03-29 12:53:14 [Connection] Proxy 43.152.113.218:19348 has expired; ready to re-fetch\n","stream":"stdout","time":"2024-03-29T04:53:14.996838827Z"}
{"log":"[INFO] 2024-03-29 12:53:14 Proxy 43.152.113.218:19348 is expired. Fetching a new one\n","stream":"stdout","time":"2024-03-29T04:53:14.996942898Z"}
{"log":"[INFO] 2024-03-29 12:53:15 Got proxy 43.152.113.218:19130 with lifespan 5\n","stream":"stdout","time":"2024-03-29T04:53:15.091310189Z"}
{"log":"[WARNING] 2024-03-29 12:53:18 [Connection] Proxy 43.152.113.218:19130 has expired; ready to re-fetch\n","stream":"stdout","time":"2024-03-29T04:53:18.920339583Z"}
{"log":"[INFO] 2024-03-29 12:53:18 Proxy 43.152.113.218:19130 is expired. Fetching a new one\n","stream":"stdout","time":"2024-03-29T04:53:18.920382725Z"}
{"log":"[INFO] 2024-03-29 12:53:19 Got proxy 43.152.113.218:19778 with lifespan 5\n","stream":"stdout","time":"2024-03-29T04:53:19.008106739Z"}
{"log":"[WARNING] 2024-03-29 12:53:22 [Connection] Proxy 43.152.113.218:19778 has expired; ready to re-fetch\n","stream":"stdout","time":"2024-03-29T04:53:22.834829683Z"}

The partial content of the file all_images_b.list is as follows:

alpine,2016-06-03T16:38:49.371406Z,2024-03-15T23:56:47.772803Z
haproxy,2016-06-09T14:22:05.264691Z,2024-03-13T18:58:33.096869Z
ubuntu,2016-06-03T16:21:27.736179Z,2024-03-06T03:05:00.448579Z
bash,2017-04-16T17:33:11.605221Z,2024-03-06T01:03:50.606274Z
flink,2018-09-21T18:01:43.764072Z,2024-03-06T13:20:57.03277Z
websphere-liberty,2018-11-30T01:42:27.015459Z,2024-03-06T12:23:26.047566Z
open-liberty,2018-09-21T18:03:20.457951Z,2024-03-06T13:24:55.320625Z
backdrop,2016-06-01T23:28:13.005766Z,2024-03-12T14:06:04.17552Z
busybox,2016-06-01T23:29:38.220032Z,2024-03-08T00:58:48.135874Z
postfixadmin,2019-01-22T00:12:12.412345Z,2024-03-12T22:59:44.011272Z
mageia,2016-06-01T23:26:34.786205Z,2021-04-08T19:40:26.279911Z
balena/aarch64-supervisor,2018-10-26T09:04:38.28353Z,2022-04-26T22:29:18.412409Z
balena/armv7hf-supervisor,2018-10-26T09:03:59.788141Z,2022-04-26T22:23:38.718145Z
balenalib/bananapi-m1-plus-alpine-node,2018-10-24T19:15:15.531375Z,2024-01-04T20:27:09.814764Z
balenalib/bananapi-m1-plus-debian-node,2018-10-24T19:15:14.604174Z,2024-03-03T21:11:18.209157Z
balenalib/bananapi-m1-plus-node,2018-10-24T19:15:15.562171Z,2024-03-03T21:11:27.193781Z
balenalib/amd64-alpine-node,2018-10-24T18:33:12.35072Z,2024-03-01T00:50:56.662971Z
balenalib/aarch64-ubuntu-node,2018-10-24T18:33:11.283206Z,2024-01-06T07:50:43.600178Z
balenalib/armv7hf-alpine-node,2018-10-24T18:33:10.116774Z,2024-01-04T20:02:45.525333Z
balenalib/am571x-evm-alpine-node,2018-10-24T19:15:23.488688Z,2022-07-25T21:24:39.487804Z
iridium-soda commented 6 months ago

经排查频繁切换代理是由于发送了大量请求获取tag和build history.不属于需要调整的问题。 例如:

[INFO] 2024-03-29 19:16:27 Proxy 43.152.113.218:19118 now has 3 lifetime left.
[INFO] 2024-03-29 19:16:28 Proxy 43.152.113.218:19118 now has 2 lifetime left.
[INFO] 2024-03-29 19:16:29 [24/872454] balenalib/apalis-imx6q-ubuntu-node has 10 tags now
[INFO] 2024-03-29 19:16:29 Proxy 43.152.113.218:19118 now has 1 lifetime left.
[INFO] 2024-03-29 19:16:30 [24/872454] balenalib/apalis-imx6q-ubuntu-node has 20 tags now
[INFO] 2024-03-29 19:16:30 Proxy 43.152.113.218:19118 now has 0 lifetime left.
[WARNING] 2024-03-29 19:16:30 [Connection] Proxy 43.152.113.218:19118 has expired; ready to re-fetch
[INFO] 2024-03-29 19:16:30 Proxy 43.152.113.218:19118 is expired. Fetching a new one
[INFO] 2024-03-29 19:16:30 Got proxy 43.159.30.199:19370 with lifespan 5
[INFO] 2024-03-29 19:16:31 [24/872454] balenalib/apalis-imx6q-ubuntu-node has 30 tags now
[INFO] 2024-03-29 19:16:31 Proxy 43.159.30.199:19370 now has 4 lifetime left.
[INFO] 2024-03-29 19:16:32 [24/872454] balenalib/apalis-imx6q-ubuntu-node has 40 tags now
[INFO] 2024-03-29 19:16:32 Proxy 43.159.30.199:19370 now has 3 lifetime left.
[INFO] 2024-03-29 19:16:32 [24/872454] balenalib/apalis-imx6q-ubuntu-node has 50 tags now
[INFO] 2024-03-29 19:16:32 Proxy 43.159.30.199:19370 now has 2 lifetime left.
[INFO] 2024-03-29 19:16:33 [24/872454] balenalib/apalis-imx6q-ubuntu-node has 60 tags now
[INFO] 2024-03-29 19:16:33 Proxy 43.159.30.199:19370 now has 1 lifetime left.
[INFO] 2024-03-29 19:16:34 [24/872454] balenalib/apalis-imx6q-ubuntu-node has 70 tags now
[INFO] 2024-03-29 19:16:34 Proxy 43.159.30.199:19370 now has 0 lifetime left.
[WARNING] 2024-03-29 19:16:34 [Connection] Proxy 43.159.30.199:19370 has expired; ready to re-fetch
[INFO] 2024-03-29 19:16:34 Proxy 43.159.30.199:19370 is expired. Fetching a new one
[INFO] 2024-03-29 19:16:34 Got proxy 43.159.28.58:19384 with lifespan 5
[INFO] 2024-03-29 19:16:35 [24/872454] balenalib/apalis-imx6q-ubuntu-node has 80 tags now
[INFO] 2024-03-29 19:16:35 Proxy 43.159.28.58:19384 now has 4 lifetime left.
[INFO] 2024-03-29 19:16:36 [24/872454] balenalib/apalis-imx6q-ubuntu-node has 90 tags now
[INFO] 2024-03-29 19:16:36 Proxy 43.159.28.58:19384 now has 3 lifetime left.
[INFO] 2024-03-29 19:16:36 [24/872454] balenalib/apalis-imx6q-ubuntu-node has 100 tags now
[INFO] 2024-03-29 19:16:36 [24/872454] balenalib/apalis-imx6q-ubuntu-node has tags more than 100, turncate.
[INFO] 2024-03-29 19:16:36 [24/872454] balenalib/apalis-imx6q-ubuntu-node has 100 tags