salgo60 / Wikidata_riksdagen-corpus

repository for matching Wikidata with riksdagen-corpus
MIT License
12 stars 1 forks source link

NER et al #130

Open salgo60 opened 1 year ago

salgo60 commented 1 year ago

@fredrik1984 @Lottabrorsson isnt it easier that you update directly in WIkidata adding a correct date it is not rocket science...

I think a more structured approach is that the book is transformed to TEI at least the articles... with well defined tags for

image

some of the issues I see with current Wikidata PM data


Electional Districts

Entity schema

One way to get better quality in Wikidata is using Entity Schemas - I started on one see #129

image

image

NER on SPA data to Quality check and extract parties....

WD Q1448829 has two SPA properties P4819

image

I have started to add book references ---> sj9PGLAlnmUAAAAAABfkWw is Tvåkammar-riksdagen 1867–1970

image

SPA API --> endpoints/portraits.php?id=sj9PGLAlnmUAAAAAABfkWw you have the text in json

image

SPA PortraitCatalog": ["Tvåkammar-riksdagen 1867-"

SPA started as a one man scanning project and some good programmers built him an application --> you have structured data BUT its not 100 percent e.g . PortraitCatalog": ["Tvåkammar-riksdagen 1867-"] could be from the book but is nit always

salgo60 commented 1 year ago

Twitter We did some experiments. We fed #ChatGPT with (semi-)unstructured texts and the already correct #TEI #XML.

image

image image
salgo60 commented 1 year ago

OCR from SPA

Rydberg i Korsgården, Nils August (Imp). AK för Kinnefiärdings, Kinne o Kållands ds, Skar l, 1867-1869. 12.3.1824 i Lena, Älvsb, t 3.9.1885 i Skärv. Lantbrukare. Son t hovrättskommissarien P A R Ulrika Eleonora Hårdh. — G m Lovisa Carolio Sophia Otterström ( 1836 i Sunnersberg, t na 1894 i Öglunda), dtr t lantbrukaren o kronolänsPehr O o Lovisa Florén. — Barn: Carl mannen August ( 1859), Frans David Hugo ( 1861), Sten Hakon Linus ( 1 863), Selma Louise Ulrica ( 1864), Knut Paul ( *1866). — Äg av KorsgårAugust Rydberg den i Skärv. Inskr vid Uppsala univ 1845, kameralex 1848. Kronolänsman i Kållands hd 1856. — L av styr för Lidköping—Skara —Stenstorps järnvägs AB Skar I folkhögskola. L av direktionerna för o länslasarettet o Skar I brandstodsbol 1884—85. Kommunalman i Skärv. Riksdagsuppdrag: Suppl i BaU 1 868—69. L av opinionsnämnden 1867. Valman vid fullmvalen 1869. Motioner: Sex, om ändringar vad gäller giftomannarätten, bla att myndig kvinna skulle kunna vara sin egen giftoman, samt om ändringar i konkurslagen, resereglementet o stadgandena angående rättegångsfullmakt. Litt.: SP.

Corrected text

Rydberg i Korsgården, Nils August (Imp). AK för Kinnefiärdings, Kinne o Kållands ds, Skar l, 1867-1869. 12.3.1824 i Lena, Älvsb, t 3.9.1885 i Skärv. Lantbrukare. Son t hovrättskommissarien P A R Ulrika Eleonora Hårdh. — G m Lovisa Carolio Sophia Otterström ( 1836 i Sunnersberg, t na 1894 i Öglunda), dtr t lantbrukaren o kronoläns Pehr O o Lovisa Florén. — Barn: Carl mannen August ( 1859), Frans David Hugo ( 1861), Sten Hakon Linus ( 1 863), Selma Louise Ulrica ( 1864), Knut Paul ( *1866). — Äg av KorsgårAugust Rydberg den i Skärv. Inskr vid Uppsala univ 1845, kameralex 1848. Kronolänsman i Kållands hd 1856. — L av styr för Lidköping—Skara — Stenstorps järnvägs AB Skar I folkhögskola. L av direktionerna för o länslasarettet o Skar I brandstodsbol 1884—85. Kommunalman i Skärv. Riksdagsuppdrag: Suppl i BaU 1868—69. L av opinionsnämnden 1867. Valman vid fullmvalen 1869. Motioner: Sex, om ändringar vad gäller giftomannarätten, bla att myndig kvinna skulle kunna vara sin egen giftoman, samt om ändringar i konkurslagen, resereglementet o stadgandena angående rättegångsfullmakt. Litt.: SP.

image

salgo60 commented 1 year ago

Test with chatGPT prompt

<TEI xmlns="http://www.tei-c.org/ns/1.0">
  <teiHeader>
    <fileDesc>
      <titleStmt>
        <title>Rydberg i Korsgården</title>
      </titleStmt>
      <publicationStmt>
        <publisher>Tvåkammar-riksdagen 1867–1970</publisher>
        <idno type="wikidata">Q110346241</idno>
      </publicationStmt>
      <sourceDesc>
        <p>Svenskt porträttgalleri: XXV:2. Riksdagens andra kammare 1867-1904</p>
        <idno type="wikidata">Q21600495</idno>
      </sourceDesc>
    </fileDesc>
  </teiHeader>
  <text>
    <body>
      <person>
        <persName>
          <forename>Nils</forename>
          <forename>August</forename>
          <surname>Rydberg</surname>
          <addName>iort="Rydberg i Korsgården" wikidata="Q6085142"></addName>
        </persName>
        <occupation wikidata="Q97940862">Lantbrukare</occupation>
        <affiliation>
          <orgName>Imp</orgName>
          <idno type="wikidata">Q6085142</idno>
        </affiliation>
        <event>
          <name>andrakammarledamot</name>
          <idno type="wikidata">Q81531912</idno>
          <date when="1867">1867</date>
          <date when="1869">1869</date>
          <location>
            <placeName>Kinnefjärdings, Kinne och Kållands domsagas valkrets</placeName>
            <idno type="wikidata">Q18451306</idno>
          </location>
        </event>
        <birth>
          <date when="1824-03-12">12.3.1824</date>
          <placeName wikidata="Q10556909">Lena, Älvsb</placeName>
        </birth>
        <death>
          <date when="1885-09-03">3.9.1885</date>
          <placeName wikidata="Q10671041">Skärv</placeName>
        </death>
        <education>
          <orgName>Uppsala univ</orgName>
          <date when="1845">1845</date>
        </education>
        <relation>
          <name type="father">hovrättskommissarien P A R</name>
          <name type="mother">Ulrika Eleonora Hårdh</name>
          <name type="spouse">Lovisa Carolio Sophia Otterström</name>
          <birth when="1836">1836</birth>
          <death when="1894">1894</death>
          <location>
            <placeName wikidata="Q11036613">Sunnersberg</placeName>
            <placeName wikidata="Q10670476">Öglunda</placeName>
          </location>
        </relation>
        <relation>
          <name type="child">Carl August</name>
          <birth when="1859">1859</birth>
        </relation>
        <relation>
          <name type="child">Frans David Hugo</name>
          <birth when="1861">1861</birth>
        </relation>
        <relation>
             <name type="child">Sten Hakon Linus</name>
             <birth when="1863">1863</birth>
        </relation>
        <relation>
            <name type="child">Selma Louise Ulrica</name>
            <birth when="1864">1864</birth>
        </relation>
        <relation>
            <name type="child">Knut Paul</name>
            <birth when="1866">1866</birth>
         </relation>
       </person>
</body>
</text>
</TEI>
salgo60 commented 1 year ago

Test Erik Sparre prompt - very bad OCR

salgo60 commented 1 year ago

Carl Rydqvist

salgo60 commented 1 year ago

WD Q6166587

<TEI xmlns="http://www.tei-c.org/ns/1.0">
  <teiHeader>
    <fileDesc>
      <titleStmt>
        <title>Rönblad i Härnösand, Georg Peter</title>
      </titleStmt>
      <publicationStmt>
        <publisher>TVÅKAMMAR-RIKSDAGEN 1867–1970</publisher>
      </publicationStmt>
      <sourceDesc>
        <bibl>
          <title>Svenskt porträttgalleri : XXV:2. Riksdagens andra kammare 1867-1904</title>
          <idno type="wikidata">Q21600495</idno>
        </bibl>
      </sourceDesc>
    </fileDesc>
  </teiHeader>
  <text>
    <body>
      <person xml:id="Q21602753">
        <persName>
          <name type="iort">Rönblad i Härnösand</name>
          <name type="firstName">Georg</name>
          <name type="firstName">Peter</name>
        </persName>
        <birth when="1812-08-29">29.8.1812</birth>
        <death when="1873-05-28">28.5.1873</death>
        <birthPlace ref="Q2736813">Njurunda</birthPlace>
        <deathPlace ref="Q25287">Härnösand</deathPlace>
        <occupation ref="Q3323635">Rådman</occupation>
        <affiliation>
          <orgName>TVÅKAMMAR-RIKSDAGEN 1867–1970</orgName>
          <idno type="wikidata">Q110346241</idno>
        </affiliation>
        <relation>
          <name type="father">Eric Petter Rönberg</name>
        </relation>
        <relation>
          <name type="mother">Lovisa Frestadius</name>
        </relation>
        <relation>
          <name type="spouse">Lovisa Thalén</name>
          <birth when="1811">1811</birth>
          <death when="1854">1854</death>
        </relation>
        <relation>
          <name type="spouse">Maria Ulrika Hellberg</name>
          <birth when="1819">1819</birth>
          <death when="1882">1882</death>
        </relation>
      </person>
    </body>
  </text>
</TEI>
salgo60 commented 1 year ago

Can you create the TEI Schema from the documents you have generated

<schema xmlns="http://www.w3.org/2001/XMLSchema"
  xmlns:tei="http://www.tei-c.org/ns/1.0"
  targetNamespace="http://www.tei-c.org/ns/1.0"
  elementFormDefault="qualified"
  attributeFormDefault="unqualified">

  <import namespace="http://www.tei-c.org/ns/1.0" schemaLocation="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng"/>

  <element name="TEI">
    <complexType>
      <sequence>
        <element ref="tei:teiHeader"/>
        <element ref="tei:text"/>
      </sequence>
    </complexType>
  </element>

  <element name="teiHeader">
    <complexType>
      <sequence>
        <element ref="tei:fileDesc"/>
      </sequence>
    </complexType>
  </element>

  <element name="fileDesc">
    <complexType>
      <sequence>
        <element ref="tei:titleStmt"/>
        <element ref="tei:publicationStmt"/>
        <element ref="tei:sourceDesc"/>
      </sequence>
    </complexType>
  </element>

  <element name="titleStmt">
    <complexType>
      <sequence>
        <element ref="tei:title"/>
      </sequence>
    </complexType>
  </element>

  <element name="publicationStmt">
    <complexType>
      <sequence>
        <element ref="tei:publisher"/>
      </sequence>
    </complexType>
  </element>

  <element name="sourceDesc">
    <complexType>
      <sequence>
        <element ref="tei:bibl"/>
      </sequence>
    </complexType>
  </element>

  <element name="text">
    <complexType>
      <sequence>
        <element ref="tei:body"/>
      </sequence>
    </complexType>
  </element>

  <element name="body">
    <complexType>
      <sequence>
        <element ref="tei:person" maxOccurs="unbounded"/>
      </sequence>
    </complexType>
  </element>

  <element name="person">
    <complexType>
      <sequence>
        <element ref="tei:persName"/>
        <element ref="tei:birth"/>
        <element ref="tei:death"/>
        <element ref="tei:birthPlace"/>
        <element ref="tei:deathPlace"/>
        <element ref="tei:occupation"/>
        <element ref="tei:affiliation"/>
        <element ref="tei:relation" minOccurs="0" maxOccurs="unbounded"/>
      </sequence>
    </complexType>
  </element>

</schema>
salgo60 commented 1 year ago

Can you document the schema in the same style as Clarin Multilingual comparable corpora of parliamentary debates ParlaMint 2.1 https://www.clarin.si/repository/xmlui/handle/11356/1432

1. Introduction

This schema defines the structure for encoding biographical data about persons in a TEI-compliant XML format. It's designed to be simple and focused on basic personal information, such as name, birth, death, occupation, and relationships.

2. Schema structure

The schema follows the structure of a typical TEI document, with a teiHeader containing fileDesc, titleStmt, publicationStmt, and sourceDesc elements, followed by the text element that contains a body element with multiple person elements.

3. Elements

3.1. TEI

RootElement: Represents the root element of the TEI document.

3.2. teiHeader

Represents the header of the TEI document, containing metadata about the document.

3.3. fileDesc

Describes the electronic file, including the title statement, publication statement, and source description.

3.4. titleStmt

Contains the title of the document.

3.5. publicationStmt

Contains information about the publication of the document.

3.6. sourceDesc

Describes the source of the information in the document.

3.7. text

RootElement: Represents the main text of the TEI document.

3.8. body

RootElement: Represents the body of the text, containing multiple person elements.

3.9. person

Represents a person with biographical information such as name, birth, death, occupation, and relationships.

3.9.1. persName

Contains the full name of the person.

3.9.2. birth

Contains the birth date of the person.

3.9.3. death

Contains the death date of the person.

3.9.4. birthPlace

Contains the birthplace of the person.

3.9.5. deathPlace

Contains the place of death of the person.

3.9.6. occupation

Contains the person's occupation.

3.9.7. affiliation

Contains the person's affiliation or the organization they are associated with.

3.9.8. relation

(Optional) Represents a relationship with another person, such as family members, spouse, etc.

4. Example

An example TEI-encoded document using this schema is provided below:

<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
  <teiHeader>
    <fileDesc>
      <titleStmt>
        <title>Biographical Data</title>
      </titleStmt>
      <publicationStmt>
        <publisher>Example Publisher</publisher>
      </publicationStmt>
      <sourceDesc>
        <bibl>Example Bibliographic Source</bibl>
      </sourceDesc>
    </fileDesc>
  </teiHeader>
  <text>
    <body>
      <person>
        <persName>John Doe</persName>
        <birth>1900-01-01</birth>
        <death>1970-12-31</death>
        <birthPlace>New York</birthPlace>
        <deathPlace>Los Angeles</deathPlace>
        <occupation>Writer</occupation>
        <affiliation>Example Organization</affiliation>
        <relation type="spouse">Jane Doe</relation>
      </person>
    </body>
  </text>
</TEI>
salgo60 commented 1 year ago

image

image

salgo60 commented 1 year ago

ChatGPT promot with Wikidata info

ChatGPT

<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
  <teiHeader>
    <fileDesc>
      <titleStmt>
        <title>Rydberg i Korsgården</title>
      </titleStmt>
      <publicationStmt>
        <p>TEI/XML version of text about Nils August Rydberg</p>
      </publicationStmt>
      <sourceDesc>
        <p>Based on information from Wikidata Q6085142</p>
      </sourceDesc>
    </fileDesc>
  </teiHeader>
  <text>
    <body>
      <p>
        Rydberg i Korsgården, 
         <persName ref="http://www.wikidata.org/entity/Q6085142">Nils August (Imp)</persName>. 
         <title ref="http://www.wikidata.org/entity/Q81531912">AK</title> för Kinnefiärdings, Kinne o Kållands ds, 
         Skar l, 1867-1869. <date when="1824-03-12" 
                     ref="http://www.wikidata.org/entity/Q10556909">12.3.1824 i Lena, Älvsb</date>, 
         <date when="1885-09-03" ref="http://www.wikidata.org/entity/Q10671041">t 3.9.1885 i Skärv</date>. 
         * <occupation ref="http://www.wikidata.org/entity/Q97940862">Lantbrukare</occupation>. 
         Son t hovrättskommissarien P A R Ulrika Eleonora Hårdh. — 
         G m Lovisa Carolio Sophia Otterström 
         ( *1836 i Sunnersberg, t na 1894 i Öglunda), dtr t lantbrukaren o kronoläns Pehr O o Lovisa Florén. 
           — Barn: Carl mannen August ( *1859), Frans David Hugo ( *1861), Sten Hakon Linus ( *1 863), 
            Selma Louise Ulrica ( *1864), Knut Paul ( *1866). — 
             Äg av KorsgårAugust Rydberg den i Skärv. 
           <orgName ref="http://www.wikidata.org/entity/Q185246">Inskr vid Uppsala univ</orgName> 
             1845, kameralex 1848. Kronolänsman i Kållands hd 1856. — L av styr för Lidköping—Skara — 
             Stenstorps järnvägs AB Skar I folkhögskola. L av direktionerna för o länslasarettet o 
             Skar I brandstodsbol 1884—85. Kommunalman i Skärv. 
             Riksdagsuppdrag: Suppl i BaU 1868—69. L av opinionsnämnden 1867. 
             Valman vid fullmvalen 1869. Motioner: Sex, om ändringar vad gäller giftomannarätten, 
              bla att myndig kvinna skulle kunna vara sin egen giftoman, 
              samt om ändringar i konkurslagen, resereglementet o stadgandena 
              angående rättegångsfullmakt. Litt.: SP.
      </p>
    </body>
  </text>
</TEI>