iipc / warc-specifications

Centralised repository for WARC usage specifications.
http://iipc.github.io/warc-specifications/
100 stars 30 forks source link

Gemini as a protocol for the WARC-Protocol field #85

Closed acidus99 closed 1 year ago

acidus99 commented 1 year ago

Protocol name

Gemini

Protocol identifier

"gemini"

(By design, the Gemini spec has no version numbers)

Specification URL (optional)

https://gemini.circumlunar.space/docs/specification.gmi https://en.wikipedia.org/wiki/Gemini_(protocol)

There are some existent WARCs containing Gemini traffic on the Internet Archive as well: https://archive.org/details/mozz-gemini-crawl-2020-1 https://archive.org/details/mozz-gemini-crawl-2020-2 https://archive.org/details/mozz-gemini-crawl-2020-3

Other

Gemini is a simple application protocol similar to HTTP/0.9, that runs on top of TLS. There exists around 3000 servers, and ~1M URLs. I run one of the search engine for it, with a crawler that generates WARC files, which are then consumed by the search index as well as a Wayback Machine-style archive.

If I understand the WARC-Protocol proposal correctly, because Gemini runs on top of TLS, implementers would use 2 WARC-Protocol fields like this:

WARC-Protocol: tls/1.3
WARC-Protocol: gemini
ato commented 1 year ago

Thanks! I've added Gemini to the definitions in the WARC-Protocol proposal (#42).