googlefonts / fontbakery-dashboard

A library-scale web dashboard for Font Bakery, no longer developed
Apache License 2.0
28 stars 10 forks source link

Duplicate rows / Same family not recognised as the same #73

Open davelab6 opened 6 years ago

davelab6 commented 6 years ago

I went over the current http://35.225.170.228/dashboard and found these rows in which the same family appears as 2 rows:

screen shot 2018-06-27 at 23 53 18 screen shot 2018-06-27 at 23 53 38 screen shot 2018-06-27 at 23 54 16 screen shot 2018-06-27 at 23 54 25 screen shot 2018-06-27 at 23 55 58 screen shot 2018-06-27 at 23 57 30 screen shot 2018-06-27 at 23 57 34 screen shot 2018-06-27 at 23 57 39 screen shot 2018-06-27 at 23 58 18 screen shot 2018-06-27 at 23 58 26 screen shot 2018-06-27 at 23 59 13 screen shot 2018-06-27 at 23 59 18 screen shot 2018-06-27 at 23 59 33 screen shot 2018-06-27 at 23 59 36 screen shot 2018-06-27 at 23 59 41 screen shot 2018-06-27 at 23 59 53 screen shot 2018-06-27 at 23 59 58 screen shot 2018-06-28 at 00 00 02 screen shot 2018-06-28 at 00 00 16 screen shot 2018-06-28 at 00 00 25 screen shot 2018-06-28 at 00 00 32 screen shot 2018-06-28 at 00 00 34 screen shot 2018-06-28 at 00 00 40 screen shot 2018-06-28 at 00 02 12 screen shot 2018-06-28 at 00 03 07 screen shot 2018-06-28 at 00 03 29 screen shot 2018-06-28 at 00 03 33 screen shot 2018-06-28 at 00 03 36 screen shot 2018-06-28 at 00 03 40 screen shot 2018-06-28 at 00 03 53 screen shot 2018-06-28 at 00 03 56 screen shot 2018-06-28 at 00 04 07 screen shot 2018-06-28 at 00 04 17 screen shot 2018-06-28 at 00 04 20 screen shot 2018-06-28 at 00 04 40 screen shot 2018-06-28 at 00 04 56 screen shot 2018-06-28 at 00 05 00 screen shot 2018-06-28 at 00 05 07 screen shot 2018-06-28 at 00 05 13 screen shot 2018-06-28 at 00 05 19 screen shot 2018-06-28 at 00 05 29 screen shot 2018-06-28 at 00 05 39
graphicore commented 6 years ago

I'm on this right now. I'm thinking that it's probably best to use:

It seems to me using these places for the family name is the best compromise between control, usability and readability. Still, when a family name is wrong we'll get wrong rows and we'll have to fix that in the sources and in the database. But, eventually that shouldn't be so much cleanup work anymore, once everything is set up.

graphicore commented 6 years ago

Ok, the CSVSpreadsheet/upstream source now creates it's family_name using the family row of the spreadsheet. To fix the already created db-entries into what that source would create now I ran the following query in the rethinkdb admin interface (putting it here as documentation, as this was just ad-hoc).

// you run $ kubectl proxy
// then go to: http://localhost:8001/api/v1/namespaces/default/services/rethinkdb-admin/proxy/#dataexplorer
// and run:

r.db('fontbakery')
 .table('collectiontests')
 .getAll('CSVSpreadsheet/upstream',{index: 'collection_id'})
 .filter(row=>{return row('family_name').ne(row('metadata')('sourceDetails')('name'));})
 .update({family_name: r.row('metadata')('sourceDetails')('name')})

// resulted in:
{
    "deleted": 0 ,
    "errors": 0 ,
    "inserted": 0 ,
    "replaced": 105 ,
    "skipped": 0 ,
    "unchanged": 0
}

The git based sources follow next.

graphicore commented 6 years ago

I'm just going through the dashboard rows to find duplicate rows from the Git bases sources, to rename them into the names that are in the METADATA.pb files. @davelab6 there are some inconsistencies (not only Git sources, mostly the Spreadsheet/CSV-file upstream source):

BioRhyme

The API uses BioRhyme and BioRhyme Expanded but the METADATA.pb files use Bio Rhyme and Bio Rhyme Expanded. Are these bugs in the METADATA.pb files?

Ek Mukta

The GitHub master METADATA.pb uses Ek Mukta but the Spreadsheet/CSV uses Ek-Mukta (also Ek-Mukta Mukta Devanagari, Ek-Mukta MuktaMalar Tamil, Ek-Mukta MuktaVaani Gujarati) My guess is this should be fixed in the CSV-file to remove the hyphen from the name.

Encode Sans Semi {Condensed|Expanded}

The Spreadsheet/CSV-file should remove the hyphen from Encode Sans Semi-Condensed and Encode Sans Semi-Expanded

Fira {Code|Sans*}

The Spreadsheet/CSV-file should add spaces to: FiraCode, FiraSansCondensedHairline, FiraSansExtraCondensedHairline, FiraSansHairline, FiraSansUltra.

Pangolin

The Spreadsheet/CSV-file should (probably) rename "Pangolin Sans" into "Pangolin"

Post No Bills {Jaffna|Colombo}

The Spreadsheet/CSV-file should rename PostNoBills Colombo and PostNoBills Jaffna to include spaces.

Slabo {13px| Slabo 27px}

The Spreadsheet/CSV-file defines 3 rows: Slabo 13px, Slabo 27px and Slabo

Varela Round

The Spreadsheet/CSV-file should rename Varela Round Hebrew into Varela Round.

Jomolhari in master

Registers as alpha 3c and has neither a good file name nor a METADATA.pb. The filename is "Jomolhari-alpha3c-0605331.ttf" and the regex we usually use extracts the alpha3c part. This family is not in production.

cmunbbx (Computer Modern)

There was a PR that put Computer Modern into a wrong slot because of badly chosen file names (improved in commit 1d6f3520f9a256703e6bb831b1d832c0e49cdac4) and missing METADATA.pb https://github.com/google/fonts/pull/1129 ofl/computermodern/cmunbbx.ttf

While this is not really an issue, this is a thing that can always happen to the dashboard, because everyone can issue a PR. Just mentioning. The newer revision of the PR, I'm not sure if the naming problems are resolved yet, I'm not going to change the cmunbbx now, but eventually we'll want to get rid of it I think.

cwTeX

There was a commit Remove cwTeX fonts (from master). We still have these rows:

Should I delete them?

graphicore commented 6 years ago

Here's the rethink db query that updated existing rows to match what the sources will do in the future based on using METADATA.pb when present. The list is a good overview of where our CamelCase to names-with-spaces rules break :-D

// you run $ kubectl proxy
// then go to: http://localhost:8001/api/v1/namespaces/default/services/rethinkdb-admin/proxy/#dataexplorer
// and run:

var rename = r.expr({
  "A Bee Zee": "ABeeZee"
, "Bench Nine": "BenchNine"
, "Dawningofa New Day": "Dawning of a New Day"
, "Frederickathe Great" :"Fredericka the Great"
, "Gen Bas B": "Gentium Basic"
, "Gen Bk Bas B": "Gentium Book Basic"
, "IM Fe D Pit 28 P": "IM Fell Double Pica"
, "IM Fe D Psc 28 P": "IM Fell Double Pica SC"
, "IM Fe E Nit 28 P": "IM Fell English"
, "IM Fe E Nsc 28 P": "IM Fell English SC"
, "IM Fe F Cit 28 P": "IM Fell French Canon"
, "IM Fe F Csc 28 P": "IM Fell French Canon SC"
, "IM Fe G Pit 28 P": "IM Fell Great Primer"
, "IM Fe G Psc 28 P": "IM Fell Great Primer SC"
, "IM Fe P Iit 28 P": "IM Fell DW Pica"
, "IM Fe P Isc 28 P": "IM Fell DW Pica SC"
, "Josefin Sans Std": "Josefin Sans Std Light"
, "Lateef Reg OT": "Lateef"
, "Lovedbythe King": "Loved by the King"
, "Mc Laren": "McLaren"
, "Medieval Sharp": "MedievalSharp"
, "Mountainsof Christmas": "Mountains of Christmas"
, "OFL Goudy St MTT": "OFL Sorts Mill Goudy TT"
, "Old Standard": "Old Standard TT"
, "PTM55 FT": "PT Mono"
, "Press Start 2 P": "Press Start 2P"
, "Swankyand Moo Moo": "Swanky and Moo Moo"
, "Unifraktur Cook": "UnifrakturCook"
, "Unifraktur Maguntia": "UnifrakturMaguntia"
, "Waitingforthe Sunrise": "Waiting for the Sunrise"
, "Web": "PT Serif Caption"
, "js Math": "jsMath cmbx10"
});

r.db('fontbakery')
.table('collectiontests')
.filter(function(row) {
    return rename.keys().contains(row('family_name'));  
})
.update(function(row){
    return {"family_name": rename(row('family_name'))}
});

// resulted in:
{
    "deleted": 0 ,
    "errors": 0 ,
    "inserted": 0 ,
    "replaced": 221 ,
    "skipped": 0 ,
    "unchanged": 0
}
graphicore commented 6 years ago

I would expect these maintenance tasks to appear more often in the future, so I'm not sure if we should close this issue or keep it open. The queries posted in here are useful to have around though.

Also, the things mentioned in https://github.com/googlefonts/fontbakery-dashboard/issues/73#issuecomment-401902735 will need more changes to already existing rows once they are resolved. Further, some "early access" fonts don't have METADATA.pb files, and I expect they are likely to change their family names, or at least how they register in the dashboard.