iuni-cadre / Collaborative-projects

For non-fellow collaborative projects on CADRE
9 stars 0 forks source link

WoS citations for pubmed papers (Xin Li) #4

Open lucian-whu opened 5 years ago

lucian-whu commented 5 years ago

Dear Xiaoran,

Could you write me a file that contains the citation relationships of papers in the whole Web of Science? If possible, each line can be organized as "citing paper (WOS No.|pmid|doi|published year) \t cited paper(WOS No.|pmid|doi|published year)".

By the way, thank you for the AWS server! It is very helpful!

Yours sincerely, Xin Li 2019-03-08

lucian-whu commented 5 years ago

Sure, I will do that next week. I assume you are using pubmed data as well. We already have a copy on Azure if you do not want to download them again. By the way, please switch to GitHub issues if you have any further requests for your project, :P https://github.com/iuni-cadre/Collaborative-projects/issues Xiaoran

XiaoranYan commented 5 years ago

发件人: Yan, Xiaoran 发送时间: 2018年12月19日 12:50 收件人: Li, Xin 主题: Re: WOS and Pubmed

I can certainly help with that once I got back early next year. However, it might be worthwhile to discuss in more details before we proceed. From what I have seen, the citation data in PubMed is missing about 40% compared to WoS and MAG combined. We are planning to do a data integration by merging WoS, MAG and PubMed. And what is your plan to distinguish ​clinical paperand non-clinical paper? Do you need the MESH tag of each paper?

Xiaoran From: Li, Xin Sent: Tuesday, December 18, 2018 12:03 PM To: Yan, Xiaoran Subject: 答复: WOS and Pubmed

Thank you very much, Dear Xiaoran.

I need the citation times of all articles in PubMed. Could you write me a csv file that include each paper's pmid , title and its corresponding citation times in WOS?

Have a good day!

Xin Li 发件人: Yan, Xiaoran 发送时间: 2018年12月19日 0:05 收件人: Ding, Ying 抄送: Li, Xin; Mabry, Patricia L 主题: Re: WOS and Pubmed

Sure. Although not officially part of CADRE yet, I did built a spark database of pubmed for my own research. Let us discuss what I can help with Xin's project.

Xiaoran

On Dec 18, 2018 3:31 AM, "Ding, Ying" dingying@indiana.edu wrote:

Dear Xiaoran,

Xin is working on Pubmed and WoS with the goal to compare clinical paper
and non-clinical paper difference. He needs the citation number of WoS
articles. I ask him to talk to you. Please try to help him if possible.

thanks and have a good holiday!

Please also include him for our coming follow up meeting.

best
ying

-- 
Ying Ding
Professor of Informatics
Associate Director of Data Science Online Program
School of Informatics and Computing
Indiana University
http://info.slis.indiana.edu/~dingying/
XiaoranYan commented 5 years ago

I can certainly help with that once I got back early next year. However, it might be worthwhile to discuss in more details before we proceed. From what I have seen, the citation data in PubMed is missing about 40% compared to WoS and MAG combined. We are planning to do a data integration by merging WoS, MAG and PubMed. And what is your plan to distinguish ​clinical paper and non-clinical paper? Do you need the MESH tag of each paper?

If you want to use our PubMed data, please specify the list of columns in pubmed you want (authors, titles, abstract, etc...). If I remember this correctly, you already have a list of pubmed IDs of ​clinical papers, do you still need mesh tags?

If instead, you already have a curated pubmed data that you want to connect with WoS citations, please let us know. We can upload your data into our cloud for easier communications and access from the notebook environment.

everyxs commented 5 years ago

Hi Xin,

The requested citation table is now available inside your notebook environment. You can find it under

/AzureDownload/PMwosCItations.cxv.gz

You can re-download it by running the AzureBlobTest notebook

Xiaoran

lucian-whu commented 5 years ago

Dear Xiaoran,

Thank you very much! I have downloaded it and will look into it! Have a good night!

Xin Li

lucian-whu commented 5 years ago

Dear Xiaoran,

For the citation file you have written, I have several questions that need your kind answer: (1) Does each paper in the file have WOS number? (2) Does each paper in the file have DOI number? (3) are there papers that have no PMID in the file? (4)are there papers that have no publication year in the file? Because the file is a big one, I believe it will be better to ask you for this information first. Or I will understand it by myself. Have a good day! Thank you very much!

Yours sincerely, Xin Li 2019-03-26

XiaoranYan commented 5 years ago

Dear Xiaoran,

For the citation file you have written, I have several questions that need your kind answer: (1) Does each paper in the file have WOS number? (2) Does each paper in the file have DOI number? (3) are there papers that have no PMID in the file? (4)are there papers that have no publication year in the file? Because the file is a big one, I believe it will be better to ask you for this information first. Or I will understand it by myself. Have a good day! Thank you very much!

Yours sincerely, Xin Li 2019-03-26

(1) Does each paper in the file have WOS number? Yes. I have only extracted WoS papers in our 2017 database that have a unique WoS id

(2) Does each paper in the file have DOI number? Yes. The only reliable way we can cross reference between WoS and pubmed is through DOI. To match records without DOI numbers, we have to design a principled set of matching rules and more computational power. We can discuss this if you want more coverage

(3) are there papers that have no PMID in the file? No. There are roughly 20M papers in WoS have DOI, but I have only extracted about 11.9 M that have DOI matches from the pubmed data (2018).

(4)are there papers that have no publication year in the file? I have not checked these. The pubyear data is all from WoS, you can check with pubmed data for missing or inconsistent values.

lucian-whu commented 5 years ago

Hi! Dear Xiaoran,

I have checked the citation data, the total number of the citing-cited pairs is 11,894,932. But it is very strange that there are only 7,607,845 citing papers in the dataset. This number is much smaller than what I think. So I check whether each a citing or cited paper contains its PMID. The result is yes.

I guess we have limited the citations in the papers indexed in the MEDLINE, which led to the result was very close to the data extracted from PubMed (about 5,700,000 citing papers).

So Could you kindly write me a new file that contains the citation pairs of the whole WOS? Because what exactly I want to use is the global citation information in the WOS dataset. A PubMed paper could cite a paper that is not indexed in PubMed (has no PMID), vice versa. We should include all the papers in the WOS whether it has PMID(DOI) or not. If a paper has no PMID, we can just mark it 'null' or something else.

Thanks, Xin Li 2019/04/18

XiaoranYan commented 5 years ago

So Could you kindly write me a new file that contains the citation pairs of the whole WOS?

Sure, but this would be huge. Do you only want those citations originated from PubMed matched papers (citing papers)?

We should include all the papers in the WOS whether it has PMID(DOI) or not. If a paper has no PMID, we can just mark it 'null' or something else.

This does not make sense at all. If you are only comparing clinical vs non-clinical papers in PubMed, all other WoS records that does not cite or is not cited by the matched records should not matter. Unless you plan to do mult-step citation analysis.

In general, data at this scale is very tricky to deal even with our resources. It is recommended to move your code to data, which means downloading might not be an efficient way any longer. Please attend our event next week and we can discuss possibilities moving forward. http://iuni.iu.edu/news/event/39

lucian-whu commented 5 years ago

Sure, but this would be huge. Do you only want those citations originated from PubMed matched papers (citing papers)?

No, I also need the information about papers that are not indexed in PubMed.

This does not make sense at all. If you are only comparing clinical vs non-clinical papers in PubMed, all other WoS records that does not cite or is not cited by the matched records should not matter. Unless you plan to do mult-step citation analysis.

I am sorry I didn't express my goal clearly. It was a very initial idea that comparing clinical vs non-clinical papers in PubMed, and I found someone had done it. Now, what I want to do with the citation dataset is something like this paper https://www.nature.com/articles/s41586-019-0941-9?mc_cid=ece727ac75&mc_eid=%5BUNIQID%5D, using the citation data to design some indicators for the prediction of the success of a drug. It is also an initial idea, but I believe it is promising.

In general, data at this scale is very tricky to deal even with our resources. It is recommended to move your code to data, which means downloading might not be an efficient way any longer.

Yes, of course, it will be about 200 GB as I estimate. I plan to use something like Lucene or ElasticSearch before, index and then search. it will only take about 300 GB hard drive to store and index it. However, moving code to data is also a good choice, I believe.

Thank you so much for the kind reply. I will definitely attend your great events if possible.