lvergergsk / BibGallery-FrontEnd

0 stars 0 forks source link

Parse DBLP data #2

Open kenkangxgwe opened 6 years ago

kenkangxgwe commented 6 years ago

Possible Issues

  1. The key of a record is not unique. image
  2. The personal homepage is not parsed, so we cannot access the personal information. image

Reference

  1. https://github.com/kite1988/dblp-parser
  2. http://blog.csdn.net/kite1988/article/details/5186628
kenkangxgwe commented 6 years ago

Yet another parser, which I think is better: https://github.com/ScaDS/dblp-parser

lvergergsk commented 6 years ago

The structure is like this:

Ground tags: incollection www mastersthesis book proceedings phdthesis inproceedings article 
6173776 tags in total.
46750 incollection tags in total.
2064044 www tags in total.
10 mastersthesis tags in total.
15023 book tags in total.
36599 proceedings tags in total.
64924 phdthesis tags in total.
2151343 inproceedings tags in total.
1795083 article tags in total.
incollection tag ranges: (ee,0,1) (note,0,1) (chapter,0,1) (year,1,1) (author,0,50) (title,1,1) (cdrom,0,1) (url,1,1) (number,0,1) (pages,0,1) (publisher,0,1) (cite,0,104) (booktitle,1,1) (crossref,0,1) 
www tag ranges: (ee,0,1) (note,0,8) (editor,0,6) (year,0,1) (author,0,10) (cite,0,30) (title,0,1) (crossref,0,1) (booktitle,0,1) (url,0,17) 
mastersthesis tag ranges: (ee,0,1) (year,1,1) (school,1,1) (author,1,1) (title,1,1) (url,0,1) 
book tag ranges: (ee,0,7) (editor,0,13) (note,0,1) (year,1,1) (author,0,18) (isbn,0,4) (title,1,1) (cdrom,0,1) (url,0,2) (volume,0,1) (pages,0,2) (month,0,1) (school,0,2) (series,0,2) (publisher,0,2) (cite,0,741) (booktitle,0,1) 
proceedings tag ranges: (ee,0,5) (editor,0,31) (note,0,2) (address,0,1) (year,1,1) (author,0,1) (isbn,0,3) (title,1,1) (url,0,2) (volume,0,2) (number,0,1) (journal,0,1) (pages,0,1) (series,0,2) (publisher,0,2) (cite,0,212) (booktitle,0,1) (crossref,0,1) 
phdThesis tag ranges: (ee,0,7) (note,0,2) (year,1,1) (author,1,3) (isbn,0,3) (title,1,1) (url,0,1) (volume,0,1) (number,0,1) (pages,0,2) (month,0,1) (school,0,3) (series,0,1) (publisher,0,1) 
inproceedings tag ranges: (ee,0,7) (note,0,1) (editor,0,3) (year,1,1) (author,0,139) (title,1,1) (cdrom,0,2) (url,0,3) (number,0,1) (pages,0,1) (month,0,1) (cite,0,137) (booktitle,1,1) (crossref,0,2) 
article tag ranges: (ee,0,2) (note,0,2) (editor,0,5) (year,1,1) (author,0,287) (title,1,1) (cdrom,0,1) (url,0,1) (volume,0,1) (number,0,1) (journal,0,1) (pages,0,1) (month,0,1) (publisher,0,1) (cite,0,348) (crossref,0,1) (booktitle,0,1) 
Used: 32 seconds

Simple explanation of the output, it is under format of (tagname, min, max), min=0 means the tag may be absent in some instance.

Using this analyzer: https://github.com/lvergergsk/dblpParser I'm writing the parser, feel free to discuss or contribute.

kenkangxgwe commented 6 years ago
Article Histogram: {ee={1=1667488, 2=68275}, note={1=839, 2=61}, editor={1=3, 2=5, 3=2, 4=2, 5=2}, year={1=1761497}, author={1=372504, 2=515270, 3=411305, 4=236108, 5=111568, 6=50516, 7=22804, 263=1, 8=11949, 9=6534, 10=4092, 11=2351, 12=1581, 13=1058, 14=787, 15=555, 16=381, 17=317, 18=234, 19=180, 20=176, 21=135, 22=112, 23=88, 24=81, 25=61, 26=48, 27=58, 28=57, 29=47, 30=47, 31=17, 287=2, 32=30, 33=27, 34=19, 35=15, 36=11, 37=13, 38=11, 39=6, 40=8, 41=4, 42=10, 43=7, 44=6, 45=6, 46=3, 47=8, 48=6, 49=2, 50=6, 51=2, 52=5, 53=1, 54=2, 55=5, 56=3, 57=5, 58=2, 59=2, 60=1, 61=2, 64=2, 65=1, 67=2, 68=1, 69=2, 71=1, 74=1, 75=2, 78=1, 79=1, 86=1, 92=1, 95=1, 96=1, 99=1, 101=1, 105=1, 112=1, 119=1}, title={1=1731386}, cdrom={1=4001}, article={1=1761503}, url={1=1760957}, volume={1=1760791}, number={1=1411077}, pages={1=1553175}, journal={1=1761273}, month={1=10600}, cite={1=155, 2=32, 3=25, 4=20, 5=41, 6=25, 7=23, 8=44, 9=39, 10=40, 11=38, 12=47, 13=43, 14=30, 15=48, 16=47, 17=43, 18=54, 19=39, 20=35, 21=44, 22=43, 23=35, 24=55, 25=39, 26=40, 27=37, 28=50, 29=36, 30=40, 31=39, 32=27, 33=27, 34=33, 35=32, 36=13, 37=24, 38=26, 39=11, 40=25, 41=19, 42=16, 43=12, 44=9, 45=16, 46=14, 47=10, 48=16, 49=9, 50=8, 51=9, 52=7, 53=12, 54=10, 55=8, 56=6, 57=3, 58=8, 59=1, 60=8, 61=3, 62=8, 63=6, 64=3, 65=5, 66=2, 67=2, 68=4, 69=2, 70=3, 71=4, 73=2, 74=1, 76=2, 78=1, 79=1, 81=5, 83=1, 84=1, 86=1, 87=1, 89=1, 90=1, 91=1, 92=1, 348=1, 94=1, 99=1, 100=1, 101=1, 105=1, 106=1, 107=1, 109=1, 114=1, 116=2, 117=1, 120=1, 123=1, 126=1, 136=1, 137=1, 140=1, 158=1, 159=1, 163=1, 165=2, 171=1, 172=1, 174=1, 194=1, 198=1, 205=1, 232=1, 249=1, 252=1}, publisher={1=228}, crossref={1=1886}, booktitle={1=223}}
Article Total: 1761503
Inproceedings Histogram: {ee={1=1600943, 2=359613, 3=14558, 5=2, 7=2}, note={1=228}, editor={1=3, 2=2, 3=3}, year={1=2114232}, author={1=275526, 2=593472, 3=570397, 4=355581, 5=170927, 6=77167, 7=32072, 8=15654, 9=7875, 10=4435, 11=2533, 139=1, 12=1570, 13=1065, 14=708, 15=526, 16=370, 17=265, 18=210, 19=146, 20=131, 21=103, 22=77, 23=53, 24=52, 25=51, 26=28, 27=32, 28=34, 29=20, 30=20, 31=15, 32=10, 33=9, 34=13, 35=6, 36=11, 37=11, 38=4, 39=7, 40=4, 41=3, 42=5, 43=1, 44=5, 45=4, 46=2, 47=2, 48=1, 49=1, 55=1, 56=1, 57=3, 60=1, 61=2, 62=1, 65=1, 70=1, 76=1, 77=3, 94=1, 102=1, 114=1}, title={1=2105658}, inproceedings={1=2114232}, cdrom={1=8052, 2=430}, url={1=2114231}, number={1=379}, pages={1=2004821}, month={1=1}, cite={1=41, 2=66, 3=77, 4=92, 5=91, 6=119, 7=125, 8=165, 9=175, 137=1, 10=234, 11=256, 12=282, 13=278, 14=313, 15=321, 16=291, 17=309, 18=265, 19=260, 20=267, 21=211, 22=229, 23=212, 24=176, 25=173, 26=169, 27=137, 28=125, 29=126, 30=99, 31=82, 32=71, 33=57, 34=61, 35=42, 36=46, 37=36, 38=27, 39=27, 40=36, 41=9, 42=24, 43=15, 44=20, 45=16, 46=10, 47=4, 48=4, 49=5, 50=7, 51=6, 52=3, 53=2, 54=7, 55=10, 56=3, 57=3, 58=2, 59=2, 60=1, 61=6, 62=4, 63=2, 64=3, 65=1, 66=1, 67=1, 68=1, 70=2, 71=2, 72=3, 73=1, 75=1, 76=1, 78=1, 79=1, 81=2, 85=1, 87=3, 88=1, 89=2, 95=1, 100=1, 101=1, 102=1, 122=1, 124=2}, crossref={1=2095909}, booktitle={1=2114232}}
Inproceedings Total: 2114232
Incollection Histogram: {ee={1=43852}, note={1=40939, 2=4810, 3=1168, 4=334, 5=96, 6=32, 7=7, 8=1}, chapter={1=2}, editor={1=1, 2=1, 4=2, 5=1, 6=1}, year={1=46342}, author={1=2006175, 2=44681, 3=8513, 4=3256, 5=1355, 6=659, 7=304, 8=188, 9=89, 10=57, 11=37, 12=22, 13=15, 14=11, 15=11, 16=14, 17=2, 18=2, 19=7, 21=1, 22=2, 25=1, 28=1, 29=2, 32=1, 50=1}, title={1=2074702}, cdrom={1=53}, url={1=77381, 2=7554, 3=3209, 4=1414, 5=812, 6=487, 7=407, 8=305, 9=233, 10=153, 11=79, 12=47, 13=22, 14=16, 15=4, 16=3, 17=1}, number={1=40}, pages={1=42604}, incollection={1=46325}, www={1=2028874}, cite={1=100, 2=6, 3=2, 4=1, 6=1, 7=1, 8=1, 9=2, 10=1, 11=1, 16=1, 17=2, 19=1, 20=1, 23=1, 87=1, 30=2, 31=1, 104=1, 40=1, 43=1, 44=2, 45=1, 49=1, 59=1, 60=1}, publisher={1=91}, crossref={1=42869}, booktitle={1=46326}}
Incollection Total: 46325
Proceedings Histogram: {ee={1=22618, 2=7008, 3=339, 4=2, 5=1}, editor={1=3755, 2=8824, 3=6866, 4=4425, 5=1944, 6=846, 7=399, 8=223, 9=111, 10=77, 11=45, 12=58, 13=21, 14=12, 15=11, 16=11, 17=6, 18=4, 19=4, 20=5, 21=1, 22=2, 23=1, 26=1, 27=1}, note={1=241, 2=6}, address={1=3}, year={1=35916}, author={1=2}, isbn={1=28729, 2=1471, 3=11}, title={1=35858}, url={1=35859, 2=2}, volume={1=16324, 2=1}, number={1=17}, pages={1=8}, journal={1=4}, series={1=16643, 2=2}, publisher={1=34699, 2=5}, cite={212=1}, proceedings={1=35916}, booktitle={1=35438}, crossref={1=10}}
Proceedings Total: 35916
Book Histogram: {ee={1=8778, 2=258, 3=31, 4=26, 5=10, 6=2, 7=1}, editor={1=209, 2=422, 3=382, 4=191, 5=57, 6=22, 7=5, 8=5, 10=1, 13=1}, note={1=5}, year={1=14923}, author={1=8229, 2=3409, 3=1414, 4=347, 5=124, 6=28, 7=29, 8=11, 9=6, 10=2, 12=5, 13=1, 15=1, 17=1, 18=1}, book={1=14923}, isbn={1=10805, 2=2077, 3=78, 4=9}, title={1=14910}, cdrom={1=1}, url={1=1567, 2=2}, volume={1=2813}, pages={1=10858, 2=5}, month={1=1}, school={1=1510, 2=66}, series={1=6281, 2=3}, publisher={1=13919, 2=1}, cite={643=1, 115=1, 741=1, 421=1, 342=1, 284=1, 156=1, 189=1, 365=1, 63=1}, booktitle={1=1269}}
Book Total: 14923
Website Histogram: {ee={1=1}, editor={1=1, 2=1, 4=2, 5=1, 6=1}, note={1=40917, 2=4810, 3=1168, 4=334, 5=96, 6=32, 7=7, 8=1}, year={1=17}, www={1=2028874}, author={1=1991382, 2=34707, 3=2243, 4=183, 5=19, 6=4, 10=1}, cite={1=85, 2=5, 4=1, 6=1, 30=1}, title={1=2028570}, booktitle={1=1}, crossref={1=304}, url={1=31056, 2=7554, 3=3209, 4=1414, 5=812, 6=487, 7=407, 8=305, 9=233, 10=153, 11=79, 12=47, 13=22, 14=16, 15=4, 16=3, 17=1}}
Website Total: 2028874

FYI, this is the data I parse, which gives more details about the number distribution of each attributes. What confused me is that the total numbers do not match with the result in https://github.com/lvergergsk/BibGallery-FrontEnd/issues/2#issuecomment-375998147

kenkangxgwe commented 6 years ago

Example CSV file for table ARTICLE image