Open XiaoranYan opened 5 years ago
发件人: Yan, Xiaoran 发送时间: 2019年2月18日 14:07 收件人: Tang, Xuli 主题: Re: 答复: Could you please help me with AI data sets?
I can certainly run it with the new list. However, can you first check the format of the results and confirm this is what you want?
Thank you! Xiaoran
On 2/18/19 2:03 PM, Tang, Xuli wrote:
Thank you very much!
I am sorry, we have some complementary journals, Would you please help us to add it ?
补充期刊.docx is the complementary journals
attached from xuli updated All.docx is the whole lists
Paper Set {All papers from List of Journals, all papers from list of Conferences}
Patent Set {All patents have cited or referenced papers in Paper set}
Extend Set {All papers have cited or referenced the paper in Paper Set}
We want all fields related to papers or patents.
I am sorry to give you these troubles. Sorry
Xuli Tang
Hi Prof Xiaoran After fully consideration, I think we can find journals and conferences and patents all from MAG Dataset, for its well structure and Rich links Could you please help us to get datasets from MAG? Xuli 发件人: Yan, Xiaoran 发送时间: 2019年2月22日 21:22:54 收件人: Tang, Xuli 主题: Re: 答复: 答复: 答复: Could you please help me with AI data sets?
The WoS file is in CSV format, not TSV. Please adjust your delimiter to "," and see if it works.
Xiaoran
On 2/22/19 8:27 PM, Tang, Xuli wrote:
Hi Xiaoran,
Thanks for your WOS Dataset, I downloaded WOS data, and use a python process to fetch 100 rows to see the format. But it seems not good, it has no format at all, I attached it in appendix.
Xuli 发件人: Yan, Xiaoran 发送时间: 2019年2月22日 17:13 收件人: Tang, Xuli 主题: RE: 答复: 答复: Could you please help me with AI data sets?
Hi Xuli,
This is very helpful. It seems the journals are indeed missing, but the conferences like “SIGKDD” are just called “KDD” in MAG. I will rerun the queries and try to match these conference names as many as possible.
For comparison I have also produced a similar dataset from WoS. It only has journal papers and compressed in gz format due to postgres’s bad text encoding. Our WoS postgres server is much less efficient than Azure and it will take much longer to get the extent set. You can download the WoS papers here:
Please check if WoS data quality is better for journals. If MAG data is good enough, I think it is a much better dataset as it includes citations between journals, conferences and patents.
Best,
Xiaoran
From: Tang, Xuli Sent: Friday, February 22, 2019 1:44 PM To: Yan, Xiaoran yan30@iu.edu Subject: 答复: 答复: 答复: Could you please help me with AI data sets?
Hi Prof. Xiaoran,
These three journals are not existed in MAG Dataset:
COMPUTER SPEECH AND LANGUAGE
COMPUTING AND INFORMATICS
MACHINE VISION AND APPLICATIONS
The conferences list below are also missing:
PVLDB
SIGKDD
JMLR
FUZZ-IEEE
KI
DLog
Please help us to check it and figure out whether it is caused by the wrong spelling.
Thank you!
Have a nice day!
Sorry for worse display of last email!
Xuli
All missing journals are now matched, with the following spelling correction:
COMPUTER SPEECH AND LANGUAGE -> Computer Speech & Language MACHINE VISION AND APPLICATIONS -> Journal of Machine Vision and Applications COMPUTING AND INFORMATICS -> Computing and Informatics Computers and Artificial Intelligence (merged with following new added journals) COMPUTERS AND ARTIFICIAL INTELLIGENCE -> see above IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART B-CYBERNETICS IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART C-APPLICATIONS AND RE --> IEEE Transactions on Systems, Man, and Cybernetics
All most all missing conferences are also matched, with the following correction: PVLDB -> VLDB SIGKDD -> KDD JMLR -> Journal of Machine Learning Research is already covered in the journal list FUZZ-IEEE ->FUZZIEEE KI -> kunstliche intelligenz, recognized as a journal in MAG and added to the journal list DLog -> cannot find any information related to "DLog", can you provide a link to this conference?
Hi Prof Xiaoran,
I have received your data, you did a great job, we appreciate all your help.
Thank you very much
Have a wonderful night!
Xuli
2/27/2019
Hi Prof. Xiaoran Yan,
Happy to know you will join us! I think we will start it recently, i will inform you when we get start.
The github repo is a very good resource, is it possible to fuse/link it to our AI papers?
Have a wonderful night! Xuli Tang
2019/4/3
发件人: Yan, Xiaoran 发送时间: 2019年4月3日 0:25 收件人: Tang, Xuli 主题: Re: May i invite you to join our research?
Certainly. Let me know when you are ready to discuss your plan. I can join your group meeting or have some individual discussion if your prefer.
By the way, in terms of topic diversity, MAG has a filed of study tag inferred from the full text for each paper. And they recently added a paper resources table which contains github repo's associated with each paper. Not sure how useful they will be but worth considering.
Thanks!
Xiaoran
Hi Xuli,
I did a quick search and found that only a small proportion of core Paper Set has their code(3889/477604) or data listed (842/477604). Another 2555 papers has their project web page listed.
Not sure if the data is good enough, but you can download it from https://iunimag.blob.core.windows.net/mag-2019-01-25/AIpapersCodeData.tsv?st=2019-04-15T19%3A08%3A57Z&se=2019-04-23T19%3A08%3A00Z&sp=rl&sv=2017-07-29&sr=b&sig=krZNbsYbJeeAzvRzoVCItLMGBsBduQTEhXuWIX%2B%2Fp2o%3D
Let me know if you still want such data for the Extent Set and the Patent Set.
Xiaoran
Hi Prof. Xiaoran,
Long time no see,Hope everything is perfect for you.
I have several questions about Patent in MAG dataset, pls help me
(1) The source of patent, where did you crawl these patents?
(2) I found the patents you once send me are papers, it made me so confused, why they don't have Patent number ?
131199769; 2520973981; 2572574735; 2776697666; 2201421368 | Branislav Kveton; Long Tran Thanh; Hung Bui; Jaya Kawale; Sanjay Chawla | 3; 4; 2; 1; 5 | Adobe Research, San Jose, CA#TAB#; University of Southampton, Southampton, UK#TAB#; Adobe Research, San Jose, CA#TAB#; Adobe Research, San Jose, CA#TAB#; Qatar Computing Research Institute, Qatar, University of Sydney, Australia#TAB# | 1306409833; 43439940; 1306409833; 1306409833; 28200790 | NIPS 2015 | 478 | 13559 | 2182342230 | 19703 | Patent | 639028944 | Efficient Thompson sampling for online matrix-factorization recommendation | 2015 | 2015-12-07T00:00:00.0000000 | MIT Press | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1347616180 | Eugene Tuv | 1 | Intel Corporation (Santa Clara, CA, US) | 1343180700 | Applied Artificial Intelligence | 1200 | 19203 | 2043922977 | 21227 | 10.1080/713827172 | Patent | 125501549 | Processing of high-dimensional categorical predictors in classification settings | 2003 | 2003-05-01T00:00:00.0000000 | Taylor & Francis Group | |
162341580 | Brendt Wohlberg | 1 | Theor. Div., Los Alamos Nat. Lab., Los Alamos, NM, USA | 1343871089 | ICASSP 2014 | 1698 | 11766 | 2029507661 | 19557 | 10.1109/ICASSP.2014.6854992 | Patent | 183914221 | Efficient convolutional sparse coding | 2014 | 2014-05-01T00:00:00.0000000 | IEEE | |
1618661958; 2565030873; 380014765; 2341538708 | Silvio Savarese; JunYoung Gwak; Manmohan Chandraker; Christopher Bongsoo Choy | 3; 2; 4; 1 | NEC Laboratories America, Inc. (Princeton, NJ, US); NEC Laboratories America, Inc. (Princeton, NJ, US); NEC Laboratories America, Inc. (Princeton, NJ, US); NEC Laboratories America, Inc. (Princeton, NJ, US) | 20089843; 20089843; 20089843; 20089843 | NIPS 2016 | 653 | 11351 | 2435623039 | 19046 | Patent | 2314196013 | UNIVERSAL CORRESPONDENCE NETWORK | 2016 | 2016-01-01T00:00:00.0000000 | |||
1853234685; 2106104795 | Daniel Marcu; Alexander M. Fraser | 2; 1 | University of Southern California, Marina del Rey, CA#TAB#; University of Southern California, Marina del Rey, CA#TAB# | 1174212; 1174212 | ACL 2006 | 406 | 15778 | 2140702357 | 19206 | 10.3115/1220175.1220272 | Patent | 2787860662 | Semi-Supervised Training for Statistical Word Alignment | 2006 | 2006-07-01T00:00:00.0000000 | Association for Computational Linguistics | |
1883541715; 2805617326; 2148168557; 1830956630 | Nirmala Ramanujam; Rebecca Richards-Kortum; Joydeep Ghosh; Kagan Tumer | 2; 3; 4; 1 | Biomedical Engineering Program, The University of Texas at Austin#TAB#; Biomedical Engineering Program, The University of Texas at Austin#TAB#; Dept. of Electrical and Computer Engr., The University of Texas at Austin#TAB#; Dept. of Electrical and Computer Engr., The University of Texas at Austin#TAB# | 86519309; 86519309; 86519309; 86519309 | NIPS 1996 | 163 | 9275 | 2118353561 | 20154 | Patent | 2784906694 | Spectroscopic Detection of Cervical Pre-Cancer through Radial Basis Function Networks | 1996 | 1996-01-01T00:00:00.0000000 | MIT Press | ||
1853234685; 2688706311 | Daniel Marcu; Daniel Wong | 1; 2 | University of Southern California, Marina del Rey, CA#TAB#; Language Weaver Inc., Santa Monica, CA#TAB# | 1174212; | EMNLP 2002 | 44 | 9104 | 2161792612 | 17497 | 10.3115/1118693.1118711 | Patent | 2789444254 | A Phrase-Based,Joint Probability Model for Statistical Machine Translation | 2002 | 2002-07-01T00:00:00.0000000 | Association for Computational Linguistics | |
187637129; 2157233933 | Markus Svens茅n; Christopher M. Bishop | 1; 2 | Microsoft Research, 7 J J Thomson Avenue, Cambridge CB3 0FB, UK#TAB#; Microsoft Research, 7 J J Thomson Avenue, Cambridge CB3 0FB, UK#TAB# | 1290206253; 1290206253 | Neurocomputing | 13868 | 179768 | 2167823677 | 18665 | 10.1016/j.neucom.2004.11.018 | Patent | 45693802 | Robust Bayesian mixture modelling | 2005 | 2005-03-01T00:00:00.0000000 | Elsevier Science Publishers B. V. |
Hi Xuli Tang,
The patent data is from MAG, which in turn comes from Lens.org https://www.lens.org/
Upon further inspection on lens.org, it seems the listed documents do have valid patent associated with them. For example: https://www.lens.org/lens/patent/124-189-868-017-420
The previous MAG data did not contain patent number. The recent update on 07/30/2019 included this new information. Let me know if you will be interested in an updated dataset with new information.
You should have received an official response for your CADRE fellow application. Although your proposal was not selected, you are still eligible to receive continued technical/data support from our team. Please use this Github channel for follow-up communications.
Thanks!
Xiaoran
From: Yan, Xiaoran Sent: Sunday, February 17, 2019 9:53 PM To: Tang, Xuli xulitang@iu.edu Cc: Ma, He mahe@iu.edu; Hutchinson, Matthew Alexander maahutch@iu.edu; Pentchev, Valentin vpentche@iu.edu; Patricia L Mabry (pmabry@iu.edu) pmabry@iu.edu; Ding, Ying dingying@indiana.edu Subject: RE: Could you please help me with AI data sets?
Hi Xuli,
Here are my first take of your requested dataset. The data is from MAG and consists of three TSV files. You can download with the following links.
AIpapersAll.tsv (0.8GB): contains all the papers in your lists of journals and conferences
https://iunimag.blob.core.windows.net/mag-2019-01-25/AIpapersAll.tsv?st=2019-02-18T02%3A27%3A56Z&se=2019-02-26T02%3A27%3A00Z&sp=rl&sv=2017-07-29&sr=b&sig=0wSNGWAM9MVG7zWUtSm1QpPMv4N%2BuCePkuAqZNU4NB0%3D
AIpapersOthers.tsv (7GB): contains all citing and cited documents (includes all papers and patents in MAG) that is external to the listed journals and conferences
https://iunimag.blob.core.windows.net/mag-2019-01-25/AIpapersOthers.tsv?st=2019-02-18T02%3A30%3A38Z&se=2019-02-26T02%3A30%3A00Z&sp=rl&sv=2017-07-29&sr=b&sig=espQuP%2BSdRQvhv9CzKPfoMSKJHbvY%2FrbuhOobpFEPA0%3D
AIpapersCitationsAll.tsv (7GB): contains all citations with available citing context
https://iunimag.blob.core.windows.net/mag-2019-01-25/AIpapersCitationsAll.tsv?st=2019-02-18T02%3A32%3A30Z&se=2019-02-19T02%3A32%3A30Z&sp=rl&sv=2017-07-29&sr=b&sig=Sf9EoRTjbv2vDfWKu1sahHVkE27W%2BZM4%2FArMK1v%2B9Zk%3D
The links will be valid for a week. Please download and get back to me if you find any problems. From my experience, it takes a few updates to finalize as your research progress. I will be producing a WoS dataset for the journals later this week.
Thank you!
Xiaoran
From: Tang, Xuli Sent: Wednesday, February 6, 2019 1:06 PM To: Yan, Xiaoran yan30@iu.edu Subject: Could you please help me with AI data sets?
Hi Xiaoran Yan,
Could please help us to select data from your database?
We want the data:
Paper Set {All papers from List of Journals, all papers from list of Conferences}
Patent Set {All patents have cited or referenced papers in Paper set}
Extend Set {All papers have cited or referenced the paper in Paper Set}
We want all fields related to papers or patents.
Journal list and conferences list is attached.
Thank you!
Xuli Tang