lingua-libre / RecordWizard

🌻 MediaWiki extension allowing mass recording of clean, well cut, well named pronunciation files.
https://lingualibre.org
GNU General Public License v2.0
15 stars 3 forks source link

Open External tool to alternative sparql endpoint #21

Open hugolpz opened 5 months ago

hugolpz commented 5 months ago

Qichwa services

Lingualibre JS do edit

Approach

Edit the ExternalTools.prototype.WikidataQueryService into a more generalist function.

Context

Lingualibre properties on Lingualibre items :

Test SPARQL

PREFIX qwb: <https://qichwa.wikibase.cloud/entity/>
PREFIX qdp: <https://qichwa.wikibase.cloud/prop/direct/>
PREFIX qp: <https://qichwa.wikibase.cloud/prop/>
PREFIX qps: <https://qichwa.wikibase.cloud/prop/statement/>
PREFIX qpq: <https://qichwa.wikibase.cloud/prop/qualifier/>
PREFIX qpr: <https://qichwa.wikibase.cloud/prop/reference/>
PREFIX qno: <https://qichwa.wikibase.cloud/prop/novalue/>
select ?entry ?id ?idLabel ?posLabel
where {
?entry a ontolex:LexicalEntry; 
       wikibase:lemma ?idLabel;
       wikibase:lexicalCategory [rdfs:label ?posLabel] . filter(lang(?posLabel)="en")
       # OPTIONAL { 
         ?entry qdp:P1 ?wikidata.
          BIND (iri(concat("http://www.wikidata.org/entity/",?wikidata)) as ?id)
       # }
}

Test url

Run the query > Link > SPARQL enpoint%0A%20%20%20%20%20%20%20%20%20%3Fentry%20qdp%3AP1%20%3Fwikidata.%0A%20%20%20%20%20%20%20%20%20%20BIND%20(iri(concat(%22http%3A%2F%2Fwww.wikidata.org%2Fentity%2F%22%2C%3Fwikidata))%20as%20%3Fid)%0A%7D%0A) : right click > copy link

hugolpz commented 5 months ago

Lingualibre "External tool" query to external endpoints is ideal when we want to keep a joint with Wikidata items or lexemes. It allows easier later feedback contributions to wikidata, like reinjecting Lingualibre's audios into those correct wikidata items or lexemes pages.

I tested this query on your project :

PREFIX qwb: <https://qichwa.wikibase.cloud/entity/>
PREFIX qdp: <https://qichwa.wikibase.cloud/prop/direct/>
PREFIX qp: <https://qichwa.wikibase.cloud/prop/>
PREFIX qps: <https://qichwa.wikibase.cloud/prop/statement/>
PREFIX qpq: <https://qichwa.wikibase.cloud/prop/qualifier/>
PREFIX qpr: <https://qichwa.wikibase.cloud/prop/reference/>
PREFIX qno: <https://qichwa.wikibase.cloud/prop/novalue/>
select ?entry ?id ?idLabel ?posLabel
where {
?entry a ontolex:LexicalEntry; 
       wikibase:lemma ?idLabel;
       wikibase:lexicalCategory [rdfs:label ?posLabel] . filter(lang(?posLabel)="en")
       OPTIONAL { 
         ?entry qdp:P1 ?wikidata.
          BIND (iri(concat("http://www.wikidata.org/entity/",?wikidata)) as ?id)
       }
}

Your project's data actually very rarely has a Wikidata id (P1), so there is curently no point to be using the external tool.

Solution 1 : low strategy

You can therefore equally create a non-jointed wikipage list (Telegram discussion > solution 2) :

Open https://lingualibre.org/wiki/List:Que/Elwin . Where `Que` is your language's iso 639-3.
Add by hand your 6,000 words, one word per line such as :
# word1
# word2
# word3
Save.
Message me, i will do some edit.

Then, open Lingualibre.org recording studio.
Step2: select Quechua
Step3: select "Local list" > search : List:Que/Elwin

Solution 2 : medium strategy

1. Create 2 new properties on Lingualibre
   - Property `Lexicographic external base` : `qichwa.wikibase.cloud`
   - Property `Lexicographic external base ID` : `L2` (for https://qichwa.wikibase.cloud/wiki/Lexeme:L2 )
2. Finish to fix externaltool.js so it
   2.1 pulls from qichwa : 
      - ?id = L2
      - ?idLabel yaku
   2.2 uploads .wav file to commons
   2.3 records on lingualibre item :
      - Wikimedia Commons recording pointer url ( https://commons.wikimedia.org/wiki/File:*.wav ) See example: https://lingualibre.org/wiki/Q191178#P3
      - Lexicographic external base : qichwa.wikibase.cloud
      - Lexicographic external base ID : L2
3. A Lingualibre <=> Qichwa joint now exists : 
   3.1 On Qichwa.wikibase.cloud, create property 'recording url pointer' on the model of https://lingualibre.org/wiki/Property:P3
   3.2 Use a bot to read Lingualibre Qichwa items, then read
      - ?id = P? `Lexicographic external base ID` value 
      - ?url P3 `recording url pointer`
   3.3 Use bot to update qichwa.wikibase.cloud/wiki/Lexeme:{id}#{url}

but you have one year to do so.

Solution 3: high strategy

  1. On Wikidata, request the creation of a Qichwa_wikibase_identifier. Can refer to https://wikidata.org/wiki/Wikidata:Property_proposal/Lingua_Libre_ID
  2. Mass export relevant Qichwa lexical data to Wikidata with joint via Qichwa_wikibase_identifier
  3. Use unedited ExternalTool.js to query Wikidata lexemes in Qichwa.

Sum up

Title Pro Con
👉🏼 hand made Lingualibre lists with no wikibases joint. Pro: Fasted. Con: Weakest joint.
👉🏼 externaltool.js can be made compatible to pull ?id and ?idLabel from qichwa to generated list and jointed Lingualibre items. Delay: 2~4 weeks to get into prod. Pro: Good joint.
I'm available to do so if needed. Delay: 2~4 weeks to get into prod.
Con: temporary solution, will need a bot to finish it up.
👉🏼 Wikidata property creation for Qichwa_wikibase_id. Pro: Good joint. Con: Slowest.
hugolpz commented 5 months ago

This nearly solve the issue. indexOfId switch to clarify.

'use strict';

        var PETSCAN_URL = 'petscan.wmflabs.org/',
            WDQS_URL    = 'query.wikidata.org/',
            QICHWA_URL  = 'qichwa.wikibase.cloud/query/sparql',
            rw = mw.recordWizard;

        var ExternalTools = function ( config ) {
            rw.store.generator.generic.call( this, config );
        };

        OO.inheritClass( ExternalTools, rw.store.generator.generic );

        // This line defines an internal name for the generator
        ExternalTools.static.name = 'externaltools';

        // And this one defines the name for the generator which will be displayed in the UI
        ExternalTools.static.title = 'ExternalTools';

        ExternalTools.prototype.initialize = function () {
            // The two text fields
            this.urlField = new OO.ui.TextInputWidget();
            this.limitField = new OO.ui.NumberInputWidget( { min: 1, max: 2000, value: 500, step: 10, pageStep: 100, isInteger: true } );

            // The custom layout
            this.layout = new OO.ui.Widget( {
                classes: [ 'mwe-recwiz-externaltools' ],
                content: [
                    new OO.ui.FieldLayout( this.urlField, {
                        align: 'top',
                        label: 'ExternalTools URL (PetScan, Wikidata query service):'
                    } ),
                    new OO.ui.FieldLayout(
                        this.limitField, {
                            align: 'top',
                            label: mw.message( 'mwe-recwiz-nearby-limit' ).text()
                        }
                    )
                ]
            } );

            // To be displayed, all the fields/widgets/... should be appended to "this.content.$element"
            this.content.$element.append( this.layout.$element );

            // Do not remove this line, it will initialize the popup itself
            rw.store.generator.generic.prototype.initialize.call( this );
        };

        ExternalTools.prototype.fetch = function () {
            // Get the values of our text fields
            var generator = this,
                url = this.urlField.getValue();

            this.limit = parseInt( this.limitField.getValue() );

            /*
             * TODO:
             * - list of turnkey urls
             */

            // Initialize a new promise
            this.deferred = $.Deferred();

            // Initialize our word list
            this.list = [];

            // Check if the given URL refers to an allowed external tool
            var isPetscan = url.lastIndexOf( 'http://' + PETSCAN_URL, 0 ) === 0 || url.lastIndexOf( 'https://' + PETSCAN_URL, 0 ) === 0,
                isWDQS = url.lastIndexOf( 'https://' + WDQS_URL, 0 ) === 0,
                isQICHWA = url.lastIndexOf( 'https://' + QICHWA_URL, 0 ) === 0 ;
            if ( isPetscan ) {
                // We will do an AJAX request to petscan's API
                $.get( url + '&output_compatability=quick-intersection&format=json&doit=' ).then( this.PetScan.bind( this ), function ( error ) { generator.deferred.reject( new OO.ui.Error( error ) ); } );
            }
            else if ( isWDQS ) {
                // We will do an AJAX request to Wikidata Query Service
                url = url.replace('https://query.wikidata.org/#', 'https://query.wikidata.org/sparql?query=') + '&format=json'
                $.get( url ).then( this.WikidataQueryService.bind( this ), function ( error ) { generator.deferred.reject( new OO.ui.Error( error ) ); } );
            }
            else if ( isWDQS || isQICHWA ) {
                // We will do an AJAX request to provided Query Service
                url = url.replace(/(https:\/\/\w+.\w+.\w+)\/#/, "$1" + '/sparql?query=') + '&format=json';
                $.get( url ).then( this.WikidataQueryService.bind( this ), function ( error ) { generator.deferred.reject( new OO.ui.Error( error ) ); } );
            }
            else {
                this.deferred.reject( new OO.ui.Error( 'This is not an allowed URL... It should link to PetScan or Wikidata Query.' ) );
                return this.deferred.promise();
            }

            this.lockUI();

            // At this point we're not done yet, make the dialog closing process
            // to wait the promise to be resolved or rejected
            this.deferred.then( this.unlockUI.bind( this ), this.unlockUI.bind( this ) );
            return this.deferred.promise();
        };

        ExternalTools.prototype.PetScan = function ( data ) {
            var i, page, ns, element, property,
                prefix = '',
                project = mw.util.getParamValue( 'project', data.query ),
                language = mw.util.getParamValue( 'language', data.query );

            // Check whether the response looks fine or not
            if ( data.status !== 'OK' ) {
                this.deferred.reject( new OO.ui.Error( 'Petscan outputs something weird with this URL, check it and come back afterwards.' ) );
            }

            // For projects that have a custom property, select it
            switch ( project ) {
                case 'wikipedia':
                    property = 'P19';
                    prefix = language + ':';
                    break;
                case 'wiktionary':
                    property = 'P20';
                    prefix = language + ':';
                    break;
            }

            // Parse the complete response (or at least until the limit is reached)
            for ( i = 0; i < data.pages.length && i < this.limit; i++ ) {
                page = data.pages[ i ];

                element = { text: page.page_title.replace( /_/g, ' ' ) };
                if ( property !== undefined ) {
                    ns = ( page.page_namespace !== 0 ? data.namespaces[ page.page_namespace ] : '' );
                    element[ property ] = prefix + ns + page.page_title;
                }

                this.list.push( element );
            }

            this.deferred.resolve();
        };

        ExternalTools.prototype.WikidataQueryService = function ( data ) {
            var i, item, id, label, property, element;

            // Check whether the response looks fine or not
            if ( data.results === undefined ) {
                this.deferred.reject( new OO.ui.Error( 'SPARQL Query Service outputs something weird with this URL, check it and come back afterwards.' ) );
                return;
            }
            if ( data.results.bindings.length === 0 ) {
                this.deferred.reject( new OO.ui.Error( 'No results in the request.' ) );
                return;
            }
            if ( data.results.bindings[ 0 ].id === undefined || data.results.bindings[ 0 ].label === undefined ) {
                this.deferred.reject( new OO.ui.Error( 'Result must contain both "id" and "label" field.' ) );
            }

            for( i=0; i < data.results.bindings.length; i++ ) {
                item = data.results.bindings[ i ];

            indexOfId = 31;
/* 
On wikidata indexOfId = 31.
On qichwa indexOfId = 36 ???
Switch to exact position of ID to set
<https://www.wikidata.org/entity/L2>
<https://qichwa.wikibase.cloud/entity/L2>
*/  
                id = item.id.value.substring(indexOfId);
                label = item.label.value;
                switch( id[ 0 ] ) {
                    case 'L':
                        property = 'P21';
                        break;
                    default:
                        property = 'P12';
                        break;
                }
                element = { "text": label };
                element[ property ] = id;

                this.list.push( element );
            }

            this.deferred.resolve();
        };

        ExternalTools.prototype.lockUI = function () {
            this.urlField.setDisabled( true );
            this.limitField.setDisabled( true );
        };

        ExternalTools.prototype.unlockUI = function () {
            this.urlField.setDisabled( false );
            this.limitField.setDisabled( false );

            this.getActions().get( { actions: 'save' } )[ 0 ].setDisabled( false );
        };

        rw.store.generator.register( 'externaltools', ExternalTools.static.title, 'll-externaltools', new ExternalTools() );
ElwinHuaman commented 5 months ago

Hi @hugolpz, thanks for your support during this process, I really appreciate it!

I think you clarified all questions regarding what approaches to follow (GitHub). Now, I would like to propose to continue with the Solution 2 : medium strategy you proposed (which I understand is temporal):

Roadmap:

  1. Create 2 new properties on Lingualibre
  2. Finish to fix externaltool.js (Sparql Query Service) so it
  3. Read LinguaLibre Wikimedia Commons url (P3) and update qichwa.wikibase.cloud with a similar property. -- Lingualibre JS do edit: rw.generator.ExternalTools.js -- This nearly solve the issue. indexOfId: GitHub/LinguaLibre/RecordWizard#21