pkiraly / metadata-qa-wikidata

Quality assessment for the bibliographic records of Wikidata
0 stars 1 forks source link

Validate page numbers #7

Open pkiraly opened 5 years ago

pkiraly commented 5 years ago

Categories:

extract page numbers from the transformed.json file:

grep "page(s)" transformed.json \
  | sed 's/"page(s)"/"pages"/g' \
  | jq .claims.pages \
  | grep -v "\[" \
  | grep -v "\]" \
  | sed 's/^ *//' \
  | sed 's/"//g' > pages.txt
sort pages.txt | uniq -c > uniq-pages.txt
pkiraly commented 5 years ago

Initial results:

invalid page numbers by categories:

pkiraly commented 5 years ago

There are so many errors, and it is not an easy task to make distinction between acceptable values and wrong values. Some of the wrong values are wrong not just in Wikidata, but in other sources, such as in DOI.

https://www.wikidata.org/wiki/Q33779012 pages: e99989

The article is Open Access, available here: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0099989. From the PDF it is clear that e99989 is not a page number (it has proper page numbers: 1-12), but an article identifier (and as it is, part of the DOI identifier).

PubMed (https://www.ncbi.nlm.nih.gov/pubmed/?term=24945803&report=xml&format=text):

<Pagination>
  <MedlinePgn>e99989</MedlinePgn>
</Pagination>

CrossRef:

curl -i -H "Accept: application/rdf+xml" \
  http://data.crossref.org/10.1371/JOURNAL.PONE.0099989
...
    <j.2:pageStart>e99989</j.2:pageStart>
    <j.1:startingPage>e99989</j.1:startingPage>
...
pkiraly commented 5 years ago

Typical patterns (those which has more than 1000 variations (N denotes one or more numbers, the number at the line's end are the number of variations):

N-N; quiz N: 3405
SN-SN: 2806
CDN: 3006
HN-N: 6606
SN-N: 9687
iN-N: 1341
Ns-Ns: 1459
N, N-N: 3153
N-N; quiz N-N: 5287
N-N, v: 1213
N-N, x: 1170
N–N: 33950
N-N, N: 11916
EN-N: 12059
AN: 1967
EN: 1220
N.eN-N: 4965
N-N; discussion N-N: 20652
MN-N: 1234
aN: 1394
NA-NA: 1011
eN: 210304
fN: 1300
N-N.N: 2176
mN: 1733
oN: 3483
N-N, N-N: 8105
N-N, vii: 2361
N.eN-N.eN: 1798
N-N, N-N, N-N: 1088
N-N; author reply N-N: 2516
GN-N: 3759
RN-N: 7824
NS-NS: 6361
N-N, viii: 2052
DN-N: 2335
eN-N: 10685
eN-eN: 3969
Suppl:N-N: 1194
AN-N: 2080
LN-N: 3178
WN-N: 1969
mN-N: 1879
N-N, N, N: 1181
N-N.eN: 9980
N; author reply N-N: 2380
N-N; discussion N: 7662
N-N passim: 1264
N-N, table of contents: 1451
SN-N; discussion SN-N: 1089
FN-N: 5497
N-N; author reply N: 1955
N-N, vi: 1753
N; author reply N: 2489
N, N: 3877
CN-N: 5486
EN-EN: 3125
oN-N: 3076
N-N, ix: 1611
nichtich commented 5 years ago

That are good explanations for the discussion section of the paper. In which form shall we start to write the text?

Am 31. Januar 2019 22:48:53 MEZ schrieb "Király Péter" notifications@github.com:

There are so many errors, and it is not an easy task to make distinction between acceptable values and wrong values. Some of the wrong values are wrong not just in Wikidata, but in other sources, such as in DOI.

https://www.wikidata.org/wiki/Q33779012 pages: e99989

The article is Open Access, available here: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0099989. From the PDF it is clear that e99989 is not a page number (it has proper page numbers), but an article identifier (ans as it is, part of the DOI identifier).

PubMed (https://www.ncbi.nlm.nih.gov/pubmed/?term=24945803&report=xml&format=text):

<Pagination>
 <MedlinePgn>e99989</MedlinePgn>
</Pagination>

CrossRef:

curl -i -H "Accept: application/rdf+xml" \
 http://data.crossref.org/10.1371/JOURNAL.PONE.0099989
...
   <j.2:pageStart>e99989</j.2:pageStart>
   <j.1:startingPage>e99989</j.1:startingPage>
...

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/pkiraly/metadata-qa-wikidata/issues/7#issuecomment-459519135

-- Jakob Voß via Android

pkiraly commented 5 years ago

According to the call, the journal provides a LaTeX and a Word template (https://jdiq.acm.org/authors.cfm#subm). I vote for LaTeX. We can use either http://sharelatex.gwdg.de or http://overleaf.com for collaborate editing. If you prefer to work in Word, we should start with a Google doc.