lsg551 / matricula-online-scraper

Scraper for Matricula Online
https://pypi.org/project/matricula-online-scraper/
MIT License
0 stars 0 forks source link

More intuitive CLI commands #19

Open lsg551 opened 1 month ago

lsg551 commented 1 month ago

Description

See #3 for some details on what Matricula hosts and how things are organized as well as terminology.

The following command scrapes all parishes available to Matricula (depending on the optional search/filter parameters):

$ matricula-online-scraper fetch location -e csv

This returns a list with > 8000 entries. Here's the head of the output:

country  ,region             ,name          ,url
Slovenia ,Nadškofija Maribor ,001 Apače     ,https://data.matricula-online.eu/en/slovenia/maribor/apace/
Slovenia ,Nadškofija Maribor ,002 Artiče    ,https://data.matricula-online.eu/en/slovenia/maribor/artice/
Slovenia ,Nadškofija Maribor ,004 Bele Vode ,https://data.matricula-online.eu/en/slovenia/maribor/bele-vode/
Slovenia ,Nadškofija Maribor ,005 Beltinci  ,https://data.matricula-online.eu/en/slovenia/maribor/beltinci/
Slovenia ,Nadškofija Maribor ,006 Bizeljsko ,https://data.matricula-online.eu/en/slovenia/maribor/bizeljsko/

Taking the output of the first command, i.e. the urls, we can pipe it to the second one. This following command then scrapes all available sources of a parish. For 001 Apače:

$ matricula-online-scraper fetch parish -e csv --url https://data.matricula-online.eu/en/slovenia/maribor/apace/

This returns a list with all available digitized sources of a parish. Here's the head of the output:

name                     ,url                                                               ,accession_number ,date      ,register_type            ,date_range_start ,date_range_end
Krstna knjiga / Taufbuch ,https://data.matricula-online.eu/en/slovenia/maribor/apace/00001/ ,           00001 ,1673-1689 ,Krstna knjiga / Taufbuch ,"Jan. 1, 1673"   ,"Dec. 31, 1689"
Krstna knjiga / Taufbuch ,https://data.matricula-online.eu/en/slovenia/maribor/apace/00002/ ,           00002 ,1728-1742 ,Krstna knjiga / Taufbuch ,"Jan. 1, 1728"   ,"Dec. 31, 1742"
Krstna knjiga / Taufbuch ,https://data.matricula-online.eu/en/slovenia/maribor/apace/00003/ ,           00003 ,1742-1760 ,Krstna knjiga / Taufbuch ,"Jan. 1, 1742"   ,"Dec. 31, 1760"
Krstna knjiga / Taufbuch ,https://data.matricula-online.eu/en/slovenia/maribor/apace/00004/ ,           00004 ,1760-1804 ,Krstna knjiga / Taufbuch ,"Jan. 1, 1760"   ,"Dec. 31, 1804"
Krstna knjiga / Taufbuch ,https://data.matricula-online.eu/en/slovenia/maribor/apace/00005/ ,           00005 ,1804-1820 ,Krstna knjiga / Taufbuch ,"Jan. 1, 1804"   ,"Dec. 31, 1820"

I advocate for changing the names of the subcommands to match them better to the entities of Matricula (= more intuitive):

  1. fetch location becomes list parishes which can be used like list parishes --all or list parishes --filter-place "name"
  2. fetch parish becomes list sources which can be used like list sources --parish … --parish …
  3. a new command for fetching the sources of a parish (#3) will be get source which can be used like get source --url … --url …

Affected Versions

All including the most recent one v0.3.0

This proposes a breaking change!

lsg551 commented 1 month ago
lsg551 commented 1 month ago
lsg551 commented 3 weeks ago