opensanctions / crawler-planning

Task tracking for the crawlers we're working on
https://github.com/orgs/opensanctions/projects/2
6 stars 0 forks source link

Turkey Ministry Of Interior Terrorist Wanted List #108

Closed Ketoch closed 6 months ago

Ketoch commented 6 months ago

Data URL

https://en.terorarananlar.pol.tr/tarananlar

Publisher

The Ministry Of Interior

Publisher country/territory code

No response

Type of data

Crime/Wanted/Suspected (Persons suspected or convicted of crimes and listed by official law enforcement)

Coverage region

region:Global

Can you tell us more?

No response

This is a suggestion or request

pudo commented 6 months ago

I wonder if this is behind a crawl protecting CDN; if so we'd need to manually capture it down.

Ketoch commented 6 months ago

I don't know exactly, but it looks like it is so

fjuniorr commented 6 months ago

I've managed to manually get the data with:

curl 'https://www.terorarananlar.pol.tr/ISAYWebPart/TArananlar/GetTerorleArananlarList' \
  -X 'POST' \
  -H 'Content-Length: 0' \
  -H 'Content-Type: application/json'

However when trying to replicate the call with python requests I'm getting a UNSAFE_LEGACY_RENEGOTIATION_DISABLED error

2024-03-15 22:07:01 [info     ] Running dataset                [tr_wanted] data_path=datasets/tr_wanted data_time=2024-03-15T22:07:00 dataset=tr_wanted
2024-03-15 22:07:02 [error    ] HTTPSConnectionPool(host='www.terorarananlar.pol.tr', port=443): Max retries exceeded with url: /ISAYWebPart/TArananlar/GetTerorleArananlarList (Caused by SSLError(SSLError(1, '[SSL: UNSAFE_LEGACY_RENEGOTIATION_DISABLED] unsafe legacy renegotiation disabled (_ssl.c:992)'))) [tr_wanted] dataset=tr_wanted url=https://www.terorarananlar.pol.tr/ISAYWebPart/TArananlar/GetTerorleArananlarList

From https://github.com/urllib3/urllib3/issues/2653 I learned that apparently this is what happens with OpenSSL 3.0 when connecting to legacy websites that disable renegotiation without signalling it correctly.

Is saving a local copy of the data in the repo such as in lt_illegal_websites advisable or best to use a workaround such as https://github.com/urllib3/urllib3/issues/2653#issuecomment-1733417634?

jbothma commented 6 months ago

yeah I think it's fine to enable the unsafe negotiation strategy, on the basis that we have another sanctions list that is http.

I'll notify them of the issue and ask that they look into upgrading.

Could you also add something like

if datetime.now > 2024-09-15:
    context.log.warn("Check if the SSL renegotiation strategy is still needed")

in the crawl() function?