bluekura commented 5 years ago

Memo

나중에 reference등에 사용할 지도 모르고, 연구에 도움이 될 수 있는 paper들을 미리미리 정리해둡시다.
월별로 이슈를 나누는 게 좋을 것 같습니다.

bluekura commented 5 years ago

2018/11/6

Unsupervised Identification of Study Descriptors in Toxicology Research: An Experimental Study

Authors: Drahomira Herrmannova, Steven R. Young, Robert M. Patton, Christopher G. Stahl, Nicole C. Kleinstreuer, Mary S. Wolfe
Abstract: Identifying and extracting data elements such as study descriptors in publication full texts is a critical yet manual and labor-intensive step required in a number of tasks. In this paper we address the question of identifying data elements in an unsupervised manner. Specifically, provided a set of criteria describing specific study parameters, such as species, route of administration, and dosing regimen, we develop an unsupervised approach to identify text segments (sentences) relevant to the criteria. A binary classifier trained to identify publications that met the criteria performs better when trained on the candidate sentences than when trained on sentences randomly picked from the text, supporting the intuition that our method is able to accurately identify study descriptors.
Link: https://arxiv.org/abs/1811.01183

bluekura commented 5 years ago

2018/12/10

Feature Analysis for Assessing the Quality of Wikipedia Articles through Supervised Classification

Authors: Elias Bassani, Marco Viviani
Abstract: Nowadays, thanks to Web 2.0 technologies, people have the possibility to generate and spread contents on different social media in a very easy way. In this context, the evaluation of the quality of the information that is available online is becoming more and more a crucial issue. In fact, a constant flow of contents is generated every day by often unknown sources, which are not certified by traditional authoritative entities. This requires the development of appropriate methodologies that can evaluate in a systematic way these contents, based on `objective' aspects connected with them. This would help individuals, who nowadays tend to increasingly form their opinions based on what they read online and on social media, to come into contact with information that is actually useful and verified. Wikipedia is nowadays one of the biggest online resources on which users rely as a source of information. The amount of collaboratively generated content that is sent to the online encyclopedia every day can let to the possible creation of low-quality articles (and, consequently, misinformation) if not properly monitored and revised. For this reason, in this paper, the problem of automatically assessing the quality of Wikipedia articles is considered. In particular, the focus is on the analysis of hand-crafted features that can be employed by supervised machine learning techniques to perform the classification of Wikipedia articles on qualitative bases. With respect to prior literature, a wider set of characteristics connected to Wikipedia articles are taken into account and illustrated in detail. Evaluations are performed by considering a labeled dataset provided in a prior work, and different supervised machine learning algorithms, which produced encouraging results with respect to the considered features.
Link: https://arxiv.org/abs/1812.02655

wsjung77 commented 5 years ago

https://www.nature.com/articles/s41562-018-0488-z?fbclid=IwAR0CLt1KBanRwEQNLWNEjz2j2WvCiVXbbSDzkxLhWyBQojTWnTsYC18Cre4

Early onset of structural inequality in the formation of collaborative knowledge in all Wikimedia projects

bluekura commented 5 years ago

2018/12/22

Intermediacy of publications

Authors: Lovro Šubelj, Ludo Waltman, Vincent Traag, Nees Jan van Eck (참고로 scientometrics 분야에서 빅네임들입니다)
Abstract: Citation networks of scientific publications offer fundamental insights into the structure and development of scientific knowledge. We propose a new measure, called intermediacy, for tracing the historical development of scientific knowledge. Given two publications, an older and a more recent one, intermediacy identifies publications that seem to play a major role in the historical development from the older to the more recent publication. The identified publications are important in connecting the older and the more recent publication in the citation network. After providing a formal definition of intermediacy, we study its mathematical properties. We then present two empirical case studies, one tracing historical developments at the interface between the community detection and the scientometric literature and one examining the development of the literature on peer review. We show both mathematically and empirically how intermediacy differs from main path analysis, which is the most popular approach for tracing historical developments in citation networks. Main path analysis tends to favor longer paths over shorter ones, whereas intermediacy has the opposite tendency. Compared to main path analysis, we conclude that intermediacy offers a more principled approach for tracing the historical development of scientific knowledge.
Link: https://arxiv.org/abs/1812.08259

@balla2081 백본 추출할 때 링크의 중요도를 측정하는 작업을 했죠 저희가. 이건 A와 B사이에서 "중요한 노드"를 골라내는 작업을 하는 방법론입니다. 자세히 읽어봐주세요

@balla2081 @wsjung77 논문 정리를 이슈로 하는 이유는 일단 새 논문이 올라올 때 확인하기 편하게 하기 위해섭니다. 이슈의 기간이 끝나면 close하고, 이 부분을 위키로 정리해서 옮겼으면 합니다. 그 이후에 내년 1월 이슈를 새로 만드는게 어떨까요.

bluekura commented 5 years ago

https://www.sciencedirect.com/science/article/pii/S1751157718303298

bluekura commented 5 years ago

https://arxiv.org/abs/1901.07999

Wikipedia Cultural Diversity Dataset: A Complete Cartography for 300 Language Editions

In this paper we present the Wikipedia Cultural Diversity dataset. For each existing Wikipedia language edition, the dataset contains a classification of the articles that represent its associated cultural context, i.e. all concepts and entities related to the language and to the territories where it is spoken. We describe the methodology we employed to classify articles, and the rich set of features that we defined to feed the classifier, and that are released as part of the dataset. We present several purposes for which we envision the use of this dataset, including detecting, measuring and countering content gaps in the Wikipedia project, and encouraging cross-cultural research in the field of digital humanities.

@balla2081 please check the paper asap.

jisungyoon commented 5 years ago

https://arxiv.org/abs/1901.07999

Wikipedia Cultural Diversity Dataset: A Complete Cartography for 300 Language Editions

In this paper we present the Wikipedia Cultural Diversity dataset. For each existing Wikipedia language edition, the dataset contains a classification of the articles that represent its associated cultural context, i.e. all concepts and entities related to the language and to the territories where it is spoken. We describe the methodology we employed to classify articles, and the rich set of features that we defined to feed the classifier, and that are released as part of the dataset. We present several purposes for which we envision the use of this dataset, including detecting, measuring and countering content gaps in the Wikipedia project, and encouraging cross-cultural research in the field of digital humanities.

@balla2081 please check the paper asap.

나중에 결과를 해석할 때 쓸 수도 있을 것 같다는 생각이 드는군요. 언어와 Geolocation을 매칭 할 때 쓴 방법론 혹은 두 위키 언어 사이의 거리의 reference로도 활용 가능 할 것 같습니다.

bluekura commented 5 years ago

https://royalsocietypublishing.org/doi/10.1098/rsos.171217 @balla2081 그러고보니 이 논문 읽어보셨었나요?

jisungyoon commented 5 years ago

https://royalsocietypublishing.org/doi/10.1098/rsos.171217 @balla2081 그러고보니 이 논문 읽어보셨었나요?

아니요. 방금 대략 읽어보았고, 내일 제대로 한번 읽어보도록 하겠습니다.

bluekura commented 5 years ago

https://arxiv.org/abs/1902.04298 이 페이퍼도 체크 부탁드립니다 :>

jisungyoon commented 5 years ago

https://arxiv.org/abs/1902.04298 이 페이퍼도 체크 부탁드립니다 :> 넵:)

jisungyoon commented 5 years ago

https://royalsocietypublishing.org/doi/10.1098/rsos.171217 @balla2081 그러고보니 이 논문 읽어보셨었나요?

논문을 읽어본 결과 위키피디아의 성장에 관한 이야기인 것으로 파악됩니다. 그 파악한 패턴을 보았더니 클러스터링이 되었고, 아마 정보의 spreading과 연관을 지은 것으로 보입니다. 26개의 위키의 공통문서를 샘플링 한 다음, 그 문서의 DOB(date og birth)를 이용해 클러스터링을 해보았더니 결과가 비슷하다 라고 주장하고 있습니다. 여기서 말하고 있는 것은 클러스터 보다는 위키끼리의 정보 격차로 인한 tier nested structure 정도로 해석 됩니다.

이 방법론의 단점은 많은 위키 ( > 26)를 분석할 경우 공통 문서의 수는 줄어들기 때문에 분석에 한계가 있으며, 지식구조에 관한이야기보다는 지식의 전달에 대한 이야기로 생각됩니다. 이전 연구로써 ref를 하기에는 좋은 논문으로 보이고, 나중에 결과를 분석할 때에도 어느정도 도움이 될 것으로 보입니다.

방법론 같은 경우는 거의 비슷합니다. similairty를 정의하고 클러스터링, MDS 정도를 쓰고 있으며, 네트워크에 대한 분석은 없습니다.

jisungyoon commented 5 years ago

https://arxiv.org/abs/1902.04298 이 페이퍼도 체크 부탁드립니다 :> 넵:)

이 페이퍼 같은 경우에는 wikidump에서 링크기반으로 과거 스넵샷 네트워크를 뽑는 방법으로 저희 연구와는 다른 것으로 생각 됩니다. 단순한 데이터 처리에 관한 페이퍼 인 것 같습니다.

bluekura commented 5 years ago

Analysis of the Wikipedia Network of Mathematicians

https://arxiv.org/abs/1902.07622

Bingsheng Chen, Zhengyu Lin, Tim S. Evans

We look at the network of mathematicians defined by the hyperlinks between their biographies on Wikipedia. We show how to extract this information using three snapshots of the Wikipedia data, taken in 2013, 2017 and 2018. We illustrate how such Wikipedia data can be used by performing a centrality analysis. These measures show that Hilbert and Newton are the most important mathematicians. We use our example to illustrate the strengths and weakness of centrality measures and to show how to provide estimates of the robustness of centrality measurements. In part, we do this by comparison to results from two other sources: an earlier study of biographies on the MacTutor website and a small informal survey of the opinion of mathematics and physics students at Imperial College London.

bluekura commented 5 years ago

https://arxiv.org/abs/1902.11105

@balla2081

jisungyoon commented 5 years ago

https://arxiv.org/abs/1902.11105

@balla2081

자세히 읽고 쓸지말지 고려해보겠습니다. 아직 자세히는 읽어보지 않았지만, complexity가 매우 높군요 감사합니다.

wsjung77 commented 5 years ago

What about "Literature review" page in wiki?

We have some papers to consider, so categorizing them is helpful in wiki.

jisungyoon commented 5 years ago

What about "Literature review" page in wiki?

We have some papers to consider, so categorizing them is helpful in wiki.

넵 정리되는데로 위키에 넣겠습니다.

jisungyoon commented 5 years ago

https://arxiv.org/pdf/1706.06136.pdf yy 교수님이 추천해주신 방법론입니다 읽고 쓸 수 있는지 검토 하겠습니다.

bluekura commented 5 years ago

https://epjdatascience.springeropen.com/articles/10.1140/epjds/s13688-016-0070-8

jisungyoon commented 5 years ago

https://epjdatascience.springeropen.com/articles/10.1140/epjds/s13688-016-0070-8

우선 저 논문에서 사용한 데이터는 co-occurrence data 라고 설명했지만, 결국에는 language-link에 대한 분석으로 보입니다. 결론은 language_link는 랜덤하지 않고, 어느 패턴을 보이는게 보인다라는게 결과 같은데, 이 효과를 어떻게 없앨수 있을 지가 문제네요.

bluekura commented 5 years ago

https://arxiv.org/abs/1903.08597

편하게 읽어봐 주세요...

jisungyoon commented 5 years ago

Saerens,Achbany,Fouss et al_Randomized Shortest-Path Problems_2009.pdf 어제 APCTP에 톡하러 오셨던 Ilkka 박사님이 쓰셨던 논문이라고 합니다. 아직 자세히는 안읽어봤지만 인호형이 추천해주셔서, 한 번 고려하면 좋을 것 같아서 올려두었습니다. 읽고 정리해서 공유하겠습니다.

하지만 물리적인 내용이 매우 많이 들어 있는 듯하네요 ㅜㅜ 읽는데 오래 걸릴 수 도 있을 것 같습니다.

jisungyoon commented 5 years ago

Saerens,Achbany,Fouss et al_Randomized Shortest-Path Problems_2009.pdf 어제 APCTP에 톡하러 오셨던 Ilkka 박사님이 제안하셨던 논문이라고 합니다. 아직 자세히는 안읽어봤지만 인호형이 추천해주셔서, 한 번 고려하면 좋을 것 같아서 올려두었습니다. 읽고 정리해서 공유하겠습니다.

현재는 ecology 쪽에서 활용하고 있는 방법론이라고 들었습니다.

jisungyoon commented 5 years ago

https://arxiv.org/abs/1903.08597

편하게 읽어봐 주세요...

오 저희와 다루고 있는 데이터 셋이 거의 비슷하군요. 조금 다른 점이라며 하이퍼링크, redirect 관계, view count 데이터가 있는 거긴하지만요 ㅎㅎ 지금 하고 있는 연구와는 많이는 관련 없기는 하지만, 좋은 데이터 베이스 인 것 같습니다. (이걸로 할 수 있는 것이 있는지 생각해보는 것도 좋을 것 같습니다)

여기에 ref로 달린 페이퍼 중에 https://arxiv.org/pdf/1901.09688.pdf

이 논문이 끌리긴 합니다. 각각의 네트워크를 visualization 할 때 이상 페턴을 좀 제거 해서, 네트워크 시각화를 좀 더 이쁘게 할 수 있는 논문이네요. (시각화가 원래의 목적은 아니지만, modularity 등 structure info 가 좀 더 보일 수 있게 하는 것 같습니다.) 만약 network 그림을 이쁘게 넣고싶으면 (appendix에) 한번 고려해보는 것도 좋을 것 같긴 합니다.

wsjung77 commented 5 years ago

https://arxiv.org/abs/1903.08597 편하게 읽어봐 주세요...

오 저희와 다루고 있는 데이터 셋이 거의 비슷하군요. 조금 다른 점이라며 하이퍼링크, redirect 관계, view count 데이터가 있는 거긴하지만요 ㅎㅎ 지금 하고 있는 연구와는 많이는 관련 없기는 하지만, 좋은 데이터 베이스 인 것 같습니다. (이걸로 할 수 있는 것이 있는지 생각해보는 것도 좋을 것 같습니다)

여기에 ref로 달린 페이퍼 중에 https://arxiv.org/pdf/1901.09688.pdf

이 논문이 끌리긴 합니다. 각각의 네트워크를 visualization 할 때 이상 페턴을 좀 제거 해서, 네트워크 시각화를 좀 더 이쁘게 할 수 있는 논문이네요. (시각화가 원래의 목적은 아니지만, modularity 등 structure info 가 좀 더 보일 수 있게 하는 것 같습니다.) 만약 network 그림을 이쁘게 넣고싶으면 (appendix에) 한번 고려해보는 것도 좋을 것 같긴 합니다.

이 정도면 경쟁인 거죠... 더 빡시게 가야 하겠네요. 언제나 모든 논문은 경쟁자가 여럿 동시에 진행하고 있다고 생각하세요.

bluekura commented 5 years ago

긍정적인 면을 말하자면 우리가 하는 일이 누군가 다른 사람도 관심을 가지는 중이므로, 궁금한 사람들이 있다는건 문제를 풀었을 때 좋아할 사람들이 있다는거죠 :)

bluekura commented 5 years ago

https://www.nature.com/articles/palcomms201541

체크 부탁. 1저자가 익숙하긴 하네요.

jisungyoon commented 5 years ago

https://www.nature.com/articles/palcomms201541

체크 부탁. 1저자가 익숙하긴 하네요.

확인하였습니다. 재밌게 풀었내요 스트럭처는 이논문과 가장 비슷하질 것 같습니다.

마지막에는 z-score로 regression analysis 진행했네요. 이렇게 한 선행연구가 있으니 이렇게 해도 괜찮을 것 같습니다.

그리고 널모델을 잡아서 네트워크를 그리는 것도 괜찮은 방법인 것 같습니다. 널모델을 한번 생각해보게습니다.

jisungyoon commented 5 years ago

https://www.annualreviews.org/doi/pdf/10.1146/annurev.an.15.100186.001115

language socialization 에 대한 인류학 논문입니다. 요약하자면 laugage가 knowledge 영향을 미친다는 인류학 쪽 연구쯤 되겠군요

jisungyoon commented 5 years ago

https://www.tandfonline.com/doi/pdf/10.1080/00437956.1980.11435693 language and knowledge

bluekura commented 4 years ago

@jisungyoon

http://maoz.ucdavis.edu/

이 분 + 이분의 연구에 대해서 좀 알아봐 주실래요?

Networks of Nations: The Evolution, Structure, and Impact of International Net-works, 1816-2001. New York: Cambridge University Press, 2010.

이 책으로 유명하다고 하고

http://maoz.ucdavis.edu/world-language-dataset.html

이런 데이터도 가지고 있는 것을 보니 뭔가 연관이 있어 보입니다. ^^;

jisungyoon commented 4 years ago

@jisungyoon

http://maoz.ucdavis.edu/

이 분 + 이분의 연구에 대해서 좀 알아봐 주실래요?

Networks of Nations: The Evolution, Structure, and Impact of International Net-works, 1816-2001. New York: Cambridge University Press, 2010.

이 책으로 유명하다고 하고

http://maoz.ucdavis.edu/world-language-dataset.html

이런 데이터도 가지고 있는 것을 보니 뭔가 연관이 있어 보입니다. ^^;

넵 한번 보도록 하겠습니다.

jisungyoon / Structure-of-Science

Papers to read (2018/11-2018/12) #1

Memo

2018/11/6

Unsupervised Identification of Study Descriptors in Toxicology Research: An Experimental Study

2018/12/10

Feature Analysis for Assessing the Quality of Wikipedia Articles through Supervised Classification

2018/12/22

Intermediacy of publications

Analysis of the Wikipedia Network of Mathematicians