iuni-cadre / Collaborative-projects

For non-fellow collaborative projects on CADRE
9 stars 0 forks source link

AI-literature paper extraction (Xuli Tang) #1

Open XiaoranYan opened 5 years ago

XiaoranYan commented 5 years ago

From: Yan, Xiaoran Sent: Sunday, February 17, 2019 9:53 PM To: Tang, Xuli xulitang@iu.edu Cc: Ma, He mahe@iu.edu; Hutchinson, Matthew Alexander maahutch@iu.edu; Pentchev, Valentin vpentche@iu.edu; Patricia L Mabry (pmabry@iu.edu) pmabry@iu.edu; Ding, Ying dingying@indiana.edu Subject: RE: Could you please help me with AI data sets?

Hi Xuli,

Here are my first take of your requested dataset. The data is from MAG and consists of three TSV files. You can download with the following links.

AIpapersAll.tsv (0.8GB): contains all the papers in your lists of journals and conferences

https://iunimag.blob.core.windows.net/mag-2019-01-25/AIpapersAll.tsv?st=2019-02-18T02%3A27%3A56Z&se=2019-02-26T02%3A27%3A00Z&sp=rl&sv=2017-07-29&sr=b&sig=0wSNGWAM9MVG7zWUtSm1QpPMv4N%2BuCePkuAqZNU4NB0%3D

AIpapersOthers.tsv (7GB): contains all citing and cited documents (includes all papers and patents in MAG) that is external to the listed journals and conferences

https://iunimag.blob.core.windows.net/mag-2019-01-25/AIpapersOthers.tsv?st=2019-02-18T02%3A30%3A38Z&se=2019-02-26T02%3A30%3A00Z&sp=rl&sv=2017-07-29&sr=b&sig=espQuP%2BSdRQvhv9CzKPfoMSKJHbvY%2FrbuhOobpFEPA0%3D

AIpapersCitationsAll.tsv (7GB): contains all citations with available citing context

https://iunimag.blob.core.windows.net/mag-2019-01-25/AIpapersCitationsAll.tsv?st=2019-02-18T02%3A32%3A30Z&se=2019-02-19T02%3A32%3A30Z&sp=rl&sv=2017-07-29&sr=b&sig=Sf9EoRTjbv2vDfWKu1sahHVkE27W%2BZM4%2FArMK1v%2B9Zk%3D

The links will be valid for a week. Please download and get back to me if you find any problems. From my experience, it takes a few updates to finalize as your research progress. I will be producing a WoS dataset for the journals later this week.

Thank you!

Xiaoran

From: Tang, Xuli Sent: Wednesday, February 6, 2019 1:06 PM To: Yan, Xiaoran yan30@iu.edu Subject: Could you please help me with AI data sets?

Hi Xiaoran Yan,​

Could please help us to select data from your database?

We want the data:

Paper Set {All papers from List of Journals, all papers from list of Conferences}

Patent Set {All patents have cited or referenced papers in Paper set}

Extend Set {All papers have cited or referenced the paper in Paper Set}

We want all fields related to papers or patents.

Journal list and conferences list is attached.

Thank you!

Xuli Tang

XiaoranYan commented 5 years ago

发件人: Yan, Xiaoran 发送时间: 2019年2月18日 14:07 收件人: Tang, Xuli 主题: Re: 答复: Could you please help me with AI data sets?

I can certainly run it with the new list. However, can you first check the format of the results and confirm this is what you want?

Thank you! Xiaoran

On 2/18/19 2:03 PM, Tang, Xuli wrote:

Thank you very much!

I am sorry, we have some complementary journals, Would you please help us to add it ?​

补充期刊.docx is the complementary journals

attached from xuli updated All.docx is the whole lists

Paper Set {All papers from List of Journals, all papers from list of Conferences}

Patent Set {All patents have cited or referenced papers in Paper set}

Extend Set {All papers have cited or referenced the paper in Paper Set}

We want all fields related to papers or patents.​

I am sorry to give you these troubles. Sorry

Xuli Tang

XiaoranYan commented 5 years ago

Hi Prof Xiaoran After fully consideration, I think we can find journals and conferences and patents all from MAG Dataset, for its well structure and Rich links Could you please help us to get datasets from MAG? Xuli 发件人: Yan, Xiaoran 发送时间: 2019年2月22日 21:22:54 收件人: Tang, Xuli 主题: Re: 答复: 答复: 答复: Could you please help me with AI data sets?

The WoS file is in CSV format, not TSV. Please adjust your delimiter to "," and see if it works.

Xiaoran

On 2/22/19 8:27 PM, Tang, Xuli wrote:

Hi Xiaoran,

Thanks for your WOS Dataset, I downloaded WOS data, and use a python process to fetch 100 rows to see the format. But it seems not good, it has no format at all, I attached it in appendix.​

Xuli 发件人: Yan, Xiaoran 发送时间: 2019年2月22日 17:13 收件人: Tang, Xuli 主题: RE: 答复: 答复: Could you please help me with AI data sets?

Hi Xuli,

This is very helpful. It seems the journals are indeed missing, but the conferences like “SIGKDD” are just called “KDD” in MAG. I will rerun the queries and try to match these conference names as many as possible.

For comparison I have also produced a similar dataset from WoS. It only has journal papers and compressed in gz format due to postgres’s bad text encoding. Our WoS postgres server is much less efficient than Azure and it will take much longer to get the extent set. You can download the WoS papers here:

https://iunimag.blob.core.windows.net/mag-2019-01-25/AIpapersWoS.csv.gz?st=2019-02-22T21%3A50%3A26Z&se=2019-03-02T21%3A50%3A00Z&sp=rl&sv=2017-07-29&sr=b&sig=gUwI9DlgRSIYWLMvncoXAaP5bGVaaB6qKIbs4wJ5nmk%3D

Please check if WoS data quality is better for journals. If MAG data is good enough, I think it is a much better dataset as it includes citations between journals, conferences and patents.

Best,

Xiaoran

From: Tang, Xuli Sent: Friday, February 22, 2019 1:44 PM To: Yan, Xiaoran yan30@iu.edu Subject: 答复: 答复: 答复: Could you please help me with AI data sets?

Hi Prof. Xiaoran,

These three journals are not existed in MAG Dataset:

COMPUTER SPEECH AND LANGUAGE

COMPUTING AND INFORMATICS

MACHINE VISION AND APPLICATIONS

The conferences list below are also missing:

PVLDB

SIGKDD

JMLR

FUZZ-IEEE

KI

DLog

Please help us to check it and figure out whether it is caused by the wrong spelling.

Thank you!

Have a nice day!

Sorry for worse display of last email!

Xuli​​

XiaoranYan commented 5 years ago

All missing journals are now matched, with the following spelling correction:

COMPUTER SPEECH AND LANGUAGE -> Computer Speech & Language MACHINE VISION AND APPLICATIONS -> Journal of Machine Vision and Applications COMPUTING AND INFORMATICS -> Computing and Informatics Computers and Artificial Intelligence (merged with following new added journals) COMPUTERS AND ARTIFICIAL INTELLIGENCE -> see above IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART B-CYBERNETICS IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART C-APPLICATIONS AND RE --> IEEE Transactions on Systems, Man, and Cybernetics

All most all missing conferences are also matched, with the following correction: PVLDB -> VLDB SIGKDD -> KDD JMLR -> Journal of Machine Learning Research is already covered in the journal list FUZZ-IEEE ->FUZZIEEE KI -> kunstliche intelligenz, recognized as a journal in MAG and added to the journal list DLog -> cannot find any information related to "DLog", can you provide a link to this conference?

XiaoranYan commented 5 years ago

Hi Prof Xiaoran,

I have received your data, you did a great job, we appreciate all your help.

Thank you very much

Have a wonderful night!​

Xuli

2/27/2019

XiaoranYan commented 5 years ago

Hi Prof. Xiaoran Yan,

Happy to know you will join us! I think we will start it recently, i will inform you when we get start.

The github repo is a very good resource, is it possible to fuse/link it to our AI papers?

​Have a wonderful night!​ ​Xuli Tang

​2019/4/3

发件人: Yan, Xiaoran 发送时间: 2019年4月3日 0:25 收件人: Tang, Xuli 主题: Re: May i invite you to join our research?

Certainly. Let me know when you are ready to discuss your plan. I can join your group meeting or have some individual discussion if your prefer.

By the way, in terms of topic diversity, MAG has a filed of study tag inferred from the full text for each paper. And they recently added a paper resources table which contains github repo's associated with each paper. Not sure how useful they will be but worth considering.

Thanks!

Xiaoran

XiaoranYan commented 5 years ago

Hi Xuli,

I did a quick search and found that only a small proportion of core Paper Set has their code(3889/477604) or data listed (842/477604). Another 2555 papers has their project web page listed.

Not sure if the data is good enough, but you can download it from https://iunimag.blob.core.windows.net/mag-2019-01-25/AIpapersCodeData.tsv?st=2019-04-15T19%3A08%3A57Z&se=2019-04-23T19%3A08%3A00Z&sp=rl&sv=2017-07-29&sr=b&sig=krZNbsYbJeeAzvRzoVCItLMGBsBduQTEhXuWIX%2B%2Fp2o%3D

Let me know if you still want such data for the Extent Set and the Patent Set.

Xiaoran

XiaoranYan commented 5 years ago

Hi Prof. Xiaoran,

 Long time no see,Hope everything is perfect for you.

 I have several questions about Patent in MAG dataset, pls help me

(1) The source of patent, where did you crawl these patents?

(2) I found the patents you once send me are papers, it made me so confused, why they don't have Patent number​ ?

131199769; 2520973981; 2572574735; 2776697666; 2201421368 Branislav Kveton; Long Tran Thanh; Hung Bui; Jaya Kawale; Sanjay Chawla 3; 4; 2; 1; 5 Adobe Research, San Jose, CA#TAB#; University of Southampton, Southampton, UK#TAB#; Adobe Research, San Jose, CA#TAB#; Adobe Research, San Jose, CA#TAB#; Qatar Computing Research Institute, Qatar, University of Sydney, Australia#TAB# 1306409833; 43439940; 1306409833; 1306409833; 28200790 NIPS 2015 478 13559 2182342230 19703   Patent 639028944   Efficient Thompson sampling for online matrix-factorization recommendation 2015 2015-12-07T00:00:00.0000000 MIT Press
1347616180 Eugene Tuv 1 Intel Corporation (Santa Clara, CA, US) 1343180700 Applied Artificial Intelligence 1200 19203 2043922977 21227 10.1080/713827172 Patent 125501549   Processing of high-dimensional categorical predictors in classification settings 2003 2003-05-01T00:00:00.0000000 Taylor & Francis Group
162341580 Brendt Wohlberg 1 Theor. Div., Los Alamos Nat. Lab., Los Alamos, NM, USA 1343871089 ICASSP 2014 1698 11766 2029507661 19557 10.1109/ICASSP.2014.6854992 Patent 183914221   Efficient convolutional sparse coding 2014 2014-05-01T00:00:00.0000000 IEEE
1618661958; 2565030873; 380014765; 2341538708 Silvio Savarese; JunYoung Gwak; Manmohan Chandraker; Christopher Bongsoo Choy 3; 2; 4; 1 NEC Laboratories America, Inc. (Princeton, NJ, US); NEC Laboratories America, Inc. (Princeton, NJ, US); NEC Laboratories America, Inc. (Princeton, NJ, US); NEC Laboratories America, Inc. (Princeton, NJ, US) 20089843; 20089843; 20089843; 20089843 NIPS 2016 653 11351 2435623039 19046   Patent 2314196013   UNIVERSAL CORRESPONDENCE NETWORK 2016 2016-01-01T00:00:00.0000000
1853234685; 2106104795 Daniel Marcu; Alexander M. Fraser 2; 1 University of Southern California, Marina del Rey, CA#TAB#; University of Southern California, Marina del Rey, CA#TAB# 1174212; 1174212 ACL 2006 406 15778 2140702357 19206 10.3115/1220175.1220272 Patent 2787860662   Semi-Supervised Training for Statistical Word Alignment 2006 2006-07-01T00:00:00.0000000 Association for Computational Linguistics
1883541715; 2805617326; 2148168557; 1830956630 Nirmala Ramanujam; Rebecca Richards-Kortum; Joydeep Ghosh; Kagan Tumer 2; 3; 4; 1 Biomedical Engineering Program, The University of Texas at Austin#TAB#; Biomedical Engineering Program, The University of Texas at Austin#TAB#; Dept. of Electrical and Computer Engr., The University of Texas at Austin#TAB#; Dept. of Electrical and Computer Engr., The University of Texas at Austin#TAB# 86519309; 86519309; 86519309; 86519309 NIPS 1996 163 9275 2118353561 20154   Patent 2784906694   Spectroscopic Detection of Cervical Pre-Cancer through Radial Basis Function Networks 1996 1996-01-01T00:00:00.0000000 MIT Press
1853234685; 2688706311 Daniel Marcu; Daniel Wong 1; 2 University of Southern California, Marina del Rey, CA#TAB#; Language Weaver Inc., Santa Monica, CA#TAB# 1174212; EMNLP 2002 44 9104 2161792612 17497 10.3115/1118693.1118711 Patent 2789444254   A Phrase-Based,Joint Probability Model for Statistical Machine Translation 2002 2002-07-01T00:00:00.0000000 Association for Computational Linguistics
187637129; 2157233933 Markus Svens茅n; Christopher M. Bishop 1; 2 Microsoft Research, 7 J J Thomson Avenue, Cambridge CB3 0FB, UK#TAB#; Microsoft Research, 7 J J Thomson Avenue, Cambridge CB3 0FB, UK#TAB# 1290206253; 1290206253 Neurocomputing 13868 179768 2167823677 18665 10.1016/j.neucom.2004.11.018 Patent 45693802   Robust Bayesian mixture modelling 2005 2005-03-01T00:00:00.0000000 Elsevier Science Publishers B. V.
XiaoranYan commented 5 years ago

Hi Xuli Tang,

The patent data is from MAG, which in turn comes from Lens.org https://www.lens.org/

Upon further inspection on lens.org, it seems the listed documents do have valid patent associated with them. For example: https://www.lens.org/lens/patent/124-189-868-017-420

The previous MAG data did not contain patent number​. The recent update on 07/30/2019 included this new information. Let me know if you will be interested in an updated dataset with new information.

You should have received an official response for your CADRE fellow application. Although your proposal was not selected, you are still eligible to receive continued technical/data support from our team. Please use this Github channel for follow-up communications.

Thanks!

Xiaoran