mediawiki-utilities / python-mwcites

MIT License
38 stars 11 forks source link

Catching URL nonsense in a DOI #7

Open halfak opened 9 years ago

halfak commented 9 years ago

We should stop processing a DOI when we see an important URL character -- e.g. "?", "&" or "#".

Still there are some DOIs that have these characters in them. e.g. 10.1002/(SICI)1097-0142(19960401)77:7<1356::AID-CNCR20>3.0.CO;2-#

But most of the time it's because we're processing a messy URL.

nemobis commented 7 years ago

10.1002/(SICI)1097-0142(19960401)77:7<1356::AID-CNCR20>3.0.CO;2-#

I've met this one too!

doi_isbn_pubmed_and_arxiv.enwiki-20161201.tsv.bz2 also contains some /download such as:

10.18352/bmgn-lchr.7417/galley/7464/download 10.4038/tapro.v5i1.5654/galley/4523/download

Or /getPDF:

10.00024/bdef:TuftsPDF/getPDF 10.00025/bdef:TuftsPDF/getPDF

And even:

10.1002/14356007.a12_495/abstract;jsessionid=EFC500556A6060AC9BEC57789816DC84.f01t01

halfak commented 7 years ago

That "/download" could be a perfectly valid part of a DOI. At this point, we'll be implementing heuristics to know when to stop processing a DOI. It seems we could have a list of known suffix patterns that we should strip -- like extensions, "/download", "/getPDF", etc. That would mean DOIs that actually had that as part of the DOI would be broken, but in the end, I expect this will be more useful.

nemobis commented 7 years ago

That "/download" could be a perfectly valid part of a DOI.

Are you sure? This is not how I read https://www.doi.org/doi_handbook/2_Numbering.html#2.5 :

Handle syntax imposes two constraints on the prefix — both slash and dot are "reserved characters", with the slash separating the prefix from the suffix and the dot used to extend sub prefixes.

AFAICT, everything starting with the second / in a matched string should be dropped.

halfak commented 7 years ago

Oh good. I hadn't caught that in the spec.

nemobis commented 7 years ago

Ah, I indeed misread the sentence, which is preceded by

Neither the Handle System nor DOI system policies, nor any web use currently imaginable, impose any constraints on the suffix, outside of encoding (see below).

Of course a DOI like 10./1234/abcdef or 10..1234/abcdef would be invalid. In the suffix, the dot is frequently used, but I've yet to find any slash. Sadly there's also stuff like:

10.1671/0272-4634(2002)022[0564:EAEFTC]2.0.CO;2

Which I think the current regex doesn't match.

nemobis commented 7 years ago

I think I found something now:

10.1093/jac/dkh029 10.1088/0953-2048/20/8/L03 10.1093/jac/39.3.393

nemobis commented 7 years ago

Noise identification

So these are the suffixes in the dataset:

grep -Eo "/[a-z]+$" doi.enwiki-20161201.txt | sort | uniq -c | sort -nr
   6523 /abstract
   1674 /full
   1243 /pdf
    505 /issues
    416 /currentissue
    216 /epdf
    114 /issuetoc
     90 /summary
     76 /meta
     32 /pdb
     17 /a
      9 /references
      7 /suppinfo
      5 /otherversions
      5 /citedby
      5 /b
      4 /c
      3 /standard
      3 /j
      2 /e
      2 /download
      2 /deaths
      2 /d
      2 /core
      1 /wu
      1 /wright
      1 /towne
      1 /topics
      1 /sys
      1 /stadaf
      1 /soeknr
      1 /science
      1 /sce
      1 /s
      1 /rstl
      1 /rspb
      1 /rra
      1 /rob
      1 /ref
      1 /rcm
      1 /ppi
      1 /polb
      1 /pletnik
      1 /panetti
      1 /p
      1 /nsm
      1 /metrics
      1 /masai
      1 /marks
      1 /lt
      1 /lrshef
      1 /nsm
      1 /metrics
      1 /masai
      1 /marks
      1 /lt
      1 /lrshef
      1 /lo
      1 /komatsu
      1 /kim
      1 /kier
      1 /journal
      1 /job
      1 /jid
      1 /jacsm
      1 /itj
      1 /isom
      1 /ijhit
      1 /ic
      1 /hrdq
      1 /home
      1 /gt
      1 /goldbook
      1 /gm
      1 /g
      1 /fsu
      1 /fneng
      1 /figures
      1 /erg
      1 /enu
      1 /enhanced
      1 /earlyview
      1 /djlit
      1 /dev
      1 /dcsupplemental
      1 /dawson
      1 /cst
      1 /comments
      1 /cne
      1 /cleaver
      1 /chemse
      1 /bjmcs
      1 /beej
      1 /bay
      1 /azl
      1 /articledoi
      1 /armulik
      1 /albers
      1 /ai
      1 /acref
      1 /abstrac
      1 /abstact

There's also a need to URL-decode and HTML-unescape some DOIs like

10.1002/(SICI)1096-8644(199602)99:2&lt;345::AID-AJPA9&gt;3.0.CO;2-X
10.1644/1545-1542(2000)081&lt;1025:PROPGG&gt;2.0.CO;2
10.1666/0094-8373(2000)026&lt;0450:FPINDI&gt;2.0.CO;2
10.1002/(SICI)1096-8644(199602)99:2&lt;345::AID-AJPA9&gt;3.0.CO;2-X
10.1093/acref/9780199666317.001.0001/acref-9780199666317-e-4513&gt;
10.1666/0094-8373%282004%29030%3C0203:PODITE%3E2.0.CO&rfr_id=info:sid/libx&rft.genre=article
10.1111/j.1558&ndash;5646.2007.00179.x

The only legit DOIs containing a `&' are:

10.1075/p&c.16.2.07kel
10.11588/ai.2006.1&2.11114
10.1207/s15324834basp2602&3_7
10.1207/s1532690xci1002&3_2
10.1207/s15326985ep2603&4_6
10.1207/s15327043hup1102&3_3
10.1207/s15327051hci0603&4_6
10.1207/s15327663jcp1401&2_19
10.1207/s15327698jfc0403&4_5
10.1207/S15327728JMME1602&3_4
10.1207/S15327965PLI1403&4_17
10.1207/s15327965pli1403&4_21
10.1207/S15327965PLI1403&4_9
10.1207/s15427439tc1202&3_6
10.1207/s15473341lld0103&4_2
10.2495/D&NE-V4-N2-154-169
10.2495/D&NE-V4-N2-97-104
10.2495/D&N-V2-N4-319-327

As found by a search grep '&' doi.enwiki-20161201.txt | grep -vE '&(pgs|magic|cookie|prog|title|volume|spage|issn|date|issue|search|ct|term|representation|uid|image|ttl|rft|return|item|bypass|vmode|utm|typ|tab|hl|er|code).+' for the most common "suffixes" according to grep -Eo '&[a-z]+' doi.enwiki-20161201.txt | sort | uniq -c | sort -nr

We can live with a few odd cases like:

10.1023/A:1012776919384&token2=exp=1445542903~acl=/static/pdf/3/art%253A10.1023%252FA%253A10127769</nowiki>
10.1023/A:1017572119543&token2=exp=1444092499~acl=/static/pdf/341/art%253A10.1023%252FA%253A1017
10.1666/0094-8373%282004%29030%3C0203:PODITE%3E2.0.CO&rfr_id=info:sid/libx&rft.genre=article
10.7326/0003-4819-158-3-201302050-00003&an_fo_ed

There are then various extraneous unopened ) at the end of a few hundred DOIs, while extraneous brackets are only found as part of external links.

All in all

I'm running an extraction on the latest dumps and to get a clean list of DOIs I wil run the output through these sed commands, with regexes which IMHO can easily be incorporated in mwcites:

s,/(getPDF|currentissue|issue|abstract|summary|pdf|asset|full|homepage|otherversions|epdf|issuetoc|meta|pdb|references|suppinfo|citedby|standard|download|editorial-board|earlyview|aims|page|file).*$,,g
s,(/|"|;|;jsessionid=.+|"/?>.*|&lt;|\.|&amp|]]|:--&gt|</nowiki>|&gt;|&nbsp;|'*\.?\)?\[http.+|</small>)$,,g
s,&([a-z]+=.*)$,,g
s,^([^(]+)\.?\)$,\1,g

Now cut -f5,6 citations.tsv | grep ^doi | cut -f2 | sed --regexp-extended -f doiclean.sed | sort -u (which takes less than 3 seconds) is looking remarkably clean, though there are still odd mistakes like

10.1002/dac.1162,2010
10.1007/BF00558453.pdf
10.1007/BF01414807.org
10.1007/BF01761146http://www.churchomania.com/church/551912538158670/Gestalt+Pastoral+Care
10.1007/BF<sub>00660068</sub>
10.1007/s00228-008-0554-y.pdf
10.1007/s00381-013-2168-7</small>
10.1007/s10397-007<E2><80><93>0338-x
10.1007/s10530-010-9859-8.''
10.1007/s10531-004-5020-2>
10.1007/s10686-011-9275-9.(open
10.1016/S0140-6736(17)31492-7showArticle
nemobis commented 7 years ago

The cleanup reduces the latest dump extraction from 777452 to 765499 DOIs, which is a whopping 1.53745 % error correction. ;-)

halfak commented 7 years ago

Thanks for this thorough analysis. Just finished reading through it.

Samwalton9 commented 6 years ago

Not sure if it's caught in the above issue or is a separate thing but we just ran into an issue with Google Maps URLs being caught as DOIs, because they look similar.

nemobis commented 6 years ago

Were those regexes incorporated in the latest release https://figshare.com/articles/Wikipedia_Scholarly_Article_Citations/1299540 or should I run them myself after downloading it?

nemobis commented 6 years ago

Self-answer: there's still all sorts of spurious DOIs (attached the whole list). 2018-03-23_dois.txt.gz

After applying my regexes above, the list goes from 1100422 to 1067405 lines. 2018-03-23_dois_cleaned.txt.gz