howisonlab / softcite-dataset

A gold-standard dataset of software mentions in research publications.
32 stars 50 forks source link

coded_no_mentions not in XML? #660

Closed jameshowison closed 4 years ago

jameshowison commented 4 years ago

We planned to include the identifiers for articles our annotators coded but did not find any software mentions in. They are in the CSV dataset, but they don't seem to be in the XML file?

@kermit2 any idea? @caifand

I hope that binder would work to show you, but having trouble with it at this URL:

https://mybinder.org/v2/gh/howisonlab/softcite-dataset/master?filepath=code%2FparseTEI.Rmd?urlpath=rstudio

library(xml2)

read_xml("https://raw.githubusercontent.com/ourresearch/software-mentions/master/resources/dataset/software/corpus/all_clean_post_processed.tei.xml") %>% 
  xml_find_all(ns = c("tei" = "http://www.tei-c.org/ns/1.0"),
               xpath = "//tei:fileDesc/@xml:id") %>% 
  xml_text %>% 
  length()

It's just xpath but it's showing 1247 articles. We have over 5000 articles in the csv_dataset files.

articles <- read_csv("~/softcite-dataset/data/csv_dataset/softcite_articles.csv")

articles %>% 
  filter(article_set != "training_article") %>%
  select(-coder) %>% 
  distinct() %>% 
  group_by(article_set, no_selections_found) %>% 
  tally()

gives:

article_set       no_selections_found  n
bio_article FALSE                      1        
econ_article    FALSE                      225      
econ_article    TRUE                       2233     
pmc_article FALSE                      1360     
pmc_article TRUE                       1197     

So I think the XML file should have all the fileDesc for the thousands of articles that were read but no mentions were found in....

kermitt2 commented 4 years ago

I've reviewed the final XML file and indeed you are right it only contains a few documents without any annotations. These documents are the ones I checked manually and we have in total 1247 manually checked articles.

I didn't look and validate the other documents without any annotations, so there are not in the compiled XML. Sorry for my wrong indication, it's something a bit old now and I have forgotten that aspect (though it is involving a lot of articles indeed!).

jameshowison commented 4 years ago

I wonder if those that are left in the XML are those that the curation process moved from having annotations to not having any?

On Tue, Apr 14, 2020 at 6:29 PM Patrice Lopez notifications@github.com wrote:

I've reviewed the final XML file and indeed you are right it only contains a few documents without any annotations. These documents are the ones I checked manually and we have in total 1247 manually checked articles.

I didn't look and validate the other documents without any annotations, so there are not in the compiled XML. Sorry for my wrong indication, it's something a bit old now and I have forgotten that aspect (though it is involving a lot of articles indeed!).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/howisonlab/softcite-dataset/issues/660#issuecomment-613730740, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAWOUWFLWEXHAB2BV4DIH3RMTWXHANCNFSM4MIAMD7A .

kermitt2 commented 4 years ago

yes exactly !

jameshowison commented 4 years ago

Ah, ok. So, two options.

  1. Add all articles from the csv dataset into the XML.
  2. Move all articles now without mentions into a second XML file.

Thoughts?

On Tue, Apr 14, 2020 at 8:19 PM Patrice Lopez notifications@github.com wrote:

yes exactly !

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/howisonlab/softcite-dataset/issues/660#issuecomment-613761277, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAWOURLHE7PS2F5HKOO23LRMUDRRANCNFSM4MIAMD7A .

kermitt2 commented 4 years ago

I guess about 1:

So on my side, I would not see particular issues to have everything in one single XML file.

jameshowison commented 4 years ago

Ah, yes, I see in the first con. That makes me lean towards the option of a second XML file.

Has all the text been checked in the no mention articles in the current XML file, or just the paras that once had a mention but now does not, plus any strings that match any other mention in the set?

On Thu, Apr 16, 2020 at 00:07 Patrice Lopez notifications@github.com wrote:

I guess about 1:

  • cons -> the articles not in the revised compiled XML have not been double checked like those present
  • pro -> the users of the XML file would normally focus on the available text in the XML file, in particular as training data, so the presence or not of references to documents without any mentions would have no particular impact and it does not seem useful to provide multiple cross-agreement and review on the missing articles

So on my side, I would not see particular issues to have everything in one single XML file.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/howisonlab/softcite-dataset/issues/660#issuecomment-614418366, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAWOUWZPFXFWRZPNQZ6RK3RM2G73ANCNFSM4MIAMD7A .

jameshowison commented 4 years ago

Resolved via phone, documented in JASIST paper.