Open halfak opened 9 years ago
10.1002/(SICI)1097-0142(19960401)77:7<1356::AID-CNCR20>3.0.CO;2-#
I've met this one too!
doi_isbn_pubmed_and_arxiv.enwiki-20161201.tsv.bz2 also contains some /download
such as:
10.18352/bmgn-lchr.7417/galley/7464/download 10.4038/tapro.v5i1.5654/galley/4523/download
Or /getPDF
:
10.00024/bdef:TuftsPDF/getPDF 10.00025/bdef:TuftsPDF/getPDF
And even:
10.1002/14356007.a12_495/abstract;jsessionid=EFC500556A6060AC9BEC57789816DC84.f01t01
That "/download" could be a perfectly valid part of a DOI. At this point, we'll be implementing heuristics to know when to stop processing a DOI. It seems we could have a list of known suffix patterns that we should strip -- like extensions, "/download", "/getPDF", etc. That would mean DOIs that actually had that as part of the DOI would be broken, but in the end, I expect this will be more useful.
That "/download" could be a perfectly valid part of a DOI.
Are you sure? This is not how I read https://www.doi.org/doi_handbook/2_Numbering.html#2.5 :
Handle syntax imposes two constraints on the prefix — both slash and dot are "reserved characters", with the slash separating the prefix from the suffix and the dot used to extend sub prefixes.
AFAICT, everything starting with the second / in a matched string should be dropped.
Oh good. I hadn't caught that in the spec.
Ah, I indeed misread the sentence, which is preceded by
Neither the Handle System nor DOI system policies, nor any web use currently imaginable, impose any constraints on the suffix, outside of encoding (see below).
Of course a DOI like 10./1234/abcdef or 10..1234/abcdef would be invalid. In the suffix, the dot is frequently used, but I've yet to find any slash. Sadly there's also stuff like:
10.1671/0272-4634(2002)022[0564:EAEFTC]2.0.CO;2
Which I think the current regex doesn't match.
I think I found something now:
10.1093/jac/dkh029 10.1088/0953-2048/20/8/L03 10.1093/jac/39.3.393
So these are the suffixes in the dataset:
grep -Eo "/[a-z]+$" doi.enwiki-20161201.txt | sort | uniq -c | sort -nr
6523 /abstract
1674 /full
1243 /pdf
505 /issues
416 /currentissue
216 /epdf
114 /issuetoc
90 /summary
76 /meta
32 /pdb
17 /a
9 /references
7 /suppinfo
5 /otherversions
5 /citedby
5 /b
4 /c
3 /standard
3 /j
2 /e
2 /download
2 /deaths
2 /d
2 /core
1 /wu
1 /wright
1 /towne
1 /topics
1 /sys
1 /stadaf
1 /soeknr
1 /science
1 /sce
1 /s
1 /rstl
1 /rspb
1 /rra
1 /rob
1 /ref
1 /rcm
1 /ppi
1 /polb
1 /pletnik
1 /panetti
1 /p
1 /nsm
1 /metrics
1 /masai
1 /marks
1 /lt
1 /lrshef
1 /nsm
1 /metrics
1 /masai
1 /marks
1 /lt
1 /lrshef
1 /lo
1 /komatsu
1 /kim
1 /kier
1 /journal
1 /job
1 /jid
1 /jacsm
1 /itj
1 /isom
1 /ijhit
1 /ic
1 /hrdq
1 /home
1 /gt
1 /goldbook
1 /gm
1 /g
1 /fsu
1 /fneng
1 /figures
1 /erg
1 /enu
1 /enhanced
1 /earlyview
1 /djlit
1 /dev
1 /dcsupplemental
1 /dawson
1 /cst
1 /comments
1 /cne
1 /cleaver
1 /chemse
1 /bjmcs
1 /beej
1 /bay
1 /azl
1 /articledoi
1 /armulik
1 /albers
1 /ai
1 /acref
1 /abstrac
1 /abstact
There's also a need to URL-decode and HTML-unescape some DOIs like
10.1002/(SICI)1096-8644(199602)99:2<345::AID-AJPA9>3.0.CO;2-X
10.1644/1545-1542(2000)081<1025:PROPGG>2.0.CO;2
10.1666/0094-8373(2000)026<0450:FPINDI>2.0.CO;2
10.1002/(SICI)1096-8644(199602)99:2<345::AID-AJPA9>3.0.CO;2-X
10.1093/acref/9780199666317.001.0001/acref-9780199666317-e-4513>
10.1666/0094-8373%282004%29030%3C0203:PODITE%3E2.0.CO&rfr_id=info:sid/libx&rft.genre=article
10.1111/j.1558–5646.2007.00179.x
The only legit DOIs containing a `&' are:
10.1075/p&c.16.2.07kel
10.11588/ai.2006.1&2.11114
10.1207/s15324834basp2602&3_7
10.1207/s1532690xci1002&3_2
10.1207/s15326985ep2603&4_6
10.1207/s15327043hup1102&3_3
10.1207/s15327051hci0603&4_6
10.1207/s15327663jcp1401&2_19
10.1207/s15327698jfc0403&4_5
10.1207/S15327728JMME1602&3_4
10.1207/S15327965PLI1403&4_17
10.1207/s15327965pli1403&4_21
10.1207/S15327965PLI1403&4_9
10.1207/s15427439tc1202&3_6
10.1207/s15473341lld0103&4_2
10.2495/D&NE-V4-N2-154-169
10.2495/D&NE-V4-N2-97-104
10.2495/D&N-V2-N4-319-327
As found by a search grep '&' doi.enwiki-20161201.txt | grep -vE '&(pgs|magic|cookie|prog|title|volume|spage|issn|date|issue|search|ct|term|representation|uid|image|ttl|rft|return|item|bypass|vmode|utm|typ|tab|hl|er|code).+'
for the most common "suffixes" according to grep -Eo '&[a-z]+' doi.enwiki-20161201.txt | sort | uniq -c | sort -nr
We can live with a few odd cases like:
10.1023/A:1012776919384&token2=exp=1445542903~acl=/static/pdf/3/art%253A10.1023%252FA%253A10127769</nowiki>
10.1023/A:1017572119543&token2=exp=1444092499~acl=/static/pdf/341/art%253A10.1023%252FA%253A1017
10.1666/0094-8373%282004%29030%3C0203:PODITE%3E2.0.CO&rfr_id=info:sid/libx&rft.genre=article
10.7326/0003-4819-158-3-201302050-00003&an_fo_ed
There are then various extraneous unopened )
at the end of a few hundred DOIs, while extraneous brackets are only found as part of external links.
I'm running an extraction on the latest dumps and to get a clean list of DOIs I wil run the output through these sed commands, with regexes which IMHO can easily be incorporated in mwcites:
s,/(getPDF|currentissue|issue|abstract|summary|pdf|asset|full|homepage|otherversions|epdf|issuetoc|meta|pdb|references|suppinfo|citedby|standard|download|editorial-board|earlyview|aims|page|file).*$,,g
s,(/|"|;|;jsessionid=.+|"/?>.*|<|\.|&|]]|:-->|</nowiki>|>| |'*\.?\)?\[http.+|</small>)$,,g
s,&([a-z]+=.*)$,,g
s,^([^(]+)\.?\)$,\1,g
Now cut -f5,6 citations.tsv | grep ^doi | cut -f2 | sed --regexp-extended -f doiclean.sed | sort -u
(which takes less than 3 seconds) is looking remarkably clean, though there are still odd mistakes like
10.1002/dac.1162,2010
10.1007/BF00558453.pdf
10.1007/BF01414807.org
10.1007/BF01761146http://www.churchomania.com/church/551912538158670/Gestalt+Pastoral+Care
10.1007/BF<sub>00660068</sub>
10.1007/s00228-008-0554-y.pdf
10.1007/s00381-013-2168-7</small>
10.1007/s10397-007<E2><80><93>0338-x
10.1007/s10530-010-9859-8.''
10.1007/s10531-004-5020-2>
10.1007/s10686-011-9275-9.(open
10.1016/S0140-6736(17)31492-7showArticle
The cleanup reduces the latest dump extraction from 777452 to 765499 DOIs, which is a whopping 1.53745 % error correction. ;-)
Thanks for this thorough analysis. Just finished reading through it.
Not sure if it's caught in the above issue or is a separate thing but we just ran into an issue with Google Maps URLs being caught as DOIs, because they look similar.
Were those regexes incorporated in the latest release https://figshare.com/articles/Wikipedia_Scholarly_Article_Citations/1299540 or should I run them myself after downloading it?
Self-answer: there's still all sorts of spurious DOIs (attached the whole list). 2018-03-23_dois.txt.gz
After applying my regexes above, the list goes from 1100422 to 1067405 lines. 2018-03-23_dois_cleaned.txt.gz
We should stop processing a DOI when we see an important URL character -- e.g. "?", "&" or "#".
Still there are some DOIs that have these characters in them. e.g.
10.1002/(SICI)1097-0142(19960401)77:7<1356::AID-CNCR20>3.0.CO;2-#
But most of the time it's because we're processing a messy URL.