Open XiaoranYan opened 5 years ago
Hi Yi,
Does it cover all citing sentences in all citing relationships in all disciplines? Yet, the 03-22 version PaperCitingContext.txt only has ~42 GB. Could you please advise?
No, only 208,839,812 out of the 1,399,752,645 (roughly 1/7) has citing context as of 01/25/2019.
By the way, I have opened an issue on our CADRE GitHub repo at
https://github.com/iuni-cadre/Collaborative-projects/issues/7
Please redirect all follow up conversations to our GitHub repo and only use email if privacy is a concern. Please also invite me as a collaborator to your own github repo if there is one for this particular project. You can also check out other people's issue and my response to their data requests. If you are not familiar with the data, I am always ready to help.
Thanks!
Xiaoran
Thanks, Xiaoran! For these 1/7 covered publications, are they randomly selected from all publications? Or is there any bias in their disciplines, in their published year, and in quality (e.g., number of citations)?
Good question. How about I extract a list of citation context covered paper for you so that you can investigate the sample bias? Once we have a good idea, we can design a sub-sampling scheme to balance it out.
Do you want a subset of the papers with citation context (computer science for example?) or all of them? What specific features you want for measuring sampling bias?
Thanks, Xiaoran! For the aspects of sample bias, I hope to know details on (1) discipline (e.g., whether papers with citation context information are all from CS?); (2) journal (e.g., whether papers with citation context information are all from Elsevier?); (3) citation count (e.g., whether papers with citation context information are all highly cited)?; (4) the number of citing sentences (e.g., whether papers WITHOUT citation context information contain a very limited number of citing sentences?); and (5) published year (e.g., whether papers with citation context information are published more recently?).
Xiaoran, may I kindly ask you to sample a certain number of publications from the 01/25 MAG citingContext table to see whether these biases exist, and if so, to what extent they do.
Let me know if you have any questions!
I did some quick spark analysis of the papers with citation context. You can check our the results here.
https://iuni-cadre.github.io/Collaborative-projects/MAGcitingContext/MAGcitingContextPySpark.html
Here is a brief summary:
CS is the discipline with most citation context, but other fields like biology and medicine also have comparable number of papers.
IEEE is the top publisher with Elsevier and Springer at 2nd and 3rd.
Average citation count of the papers with citation context is 25.5, which is higher than the average in MAG at 8.3.
Not sure what you mean by citing sentences. We do not have any data on content of the papers. The citing sentences are only provided if a paper has citation context. From the data set, it seems if one paper has citing context, all of its verified citations will have context.
More recent years do have higher numbers, but the curve is pretty smooth and it could be attributed to growth in academic productivity over all.
Of course, more careful analysis is required if you really want this for your research. Please let me know if there is a specific subset of the papers you would like to extract or you can learn to use Pyspark to do more analysis on the full data set which has about 11 million records and totals about 25 GB.
Thanks, Xiaoran! Besides these, I hope to know more details on topic journals occurring in this .txt file. However, to do this, we need to join two tables—one is journal.txt in the "mag" folder, while the other one is the "table" shown in the second cell in your notebook. Could you please advise how to do this? Thanks!
You can read both tables as dataframes, then join them, like the following
df = spark.read.format("parquet").load("/mnt/test/paperContext/*.parquet")
df2 = spark.read.format("csv").option("delimiter", "\t").option("header", "false").load("/mnt/test/mag/journal.txt")
df3 = df.join(df2, (df.XXX == df2.XXX), "inner")
Thanks, Xiaoran!
I tried the below code but found an error in the second line because of the wrong path. Yet, I cannot figure out the specific path of the journal.txt in my interface. Where can I find the /mnt folder in Azure? And how I can see the .parquet files?
df = spark.read.format("parquet").load("/mnt/test/paperContext/*.parquet") df2 = spark.read.format("csv").option("delimiter", "\t").option("header", "true").load("/mnt/test/mag/journal.txt") df3 = df.join(df2, (df['JournalId'] == df2[0]), "inner")
Try
display(dbutils.fs.ls("/mnt/test/mag"))
df = spark.read.format("parquet").load("/mnt/test/paperContext/*.parquet")
df2 = spark.read.format("csv").option("delimiter", "\t").option("header", "false").load("/mnt/test/mag/Journals.txt")
df3 = df.join(df2, (df['JournalId'] == df2[0]), "inner")
df3.show()
Besides using dbutils.fs.ls, you can also view the folder structure using the iunimag Data Lake Analytics web interface (the white lightning on your Azure dashborad)
Just go to Data explorer->iunimag(the green folder above catalog)->mag-2019-01-25
Thanks for your quick response!
In the table "PaperCitationContexts.txt", there are 3 columns, probably referring to paper id, citing sentence id, and citing sentence. Yet, is there any id for the cited references in a citing sentence? For instance, for the first sentence, there are two references contained in the sentence...
No. The citing sentences are not indexed. For complete schemas of MAG's tables, please refer to https://docs.microsoft.com/en-us/academic-services/graph/reference-data-schema
Hi Xiaoran,
I know that The MAG Azure version has citing sentences information. Does it cover all citing sentences in all citing relationships in all disciplines? Yet, the 03-22 version PaperCitingContext.txt only has ~42 GB. Could you please advise?
Thanks,
Yi