Closed jameshowison closed 4 years ago
I've reviewed the final XML file and indeed you are right it only contains a few documents without any annotations. These documents are the ones I checked manually and we have in total 1247 manually checked articles.
I didn't look and validate the other documents without any annotations, so there are not in the compiled XML. Sorry for my wrong indication, it's something a bit old now and I have forgotten that aspect (though it is involving a lot of articles indeed!).
I wonder if those that are left in the XML are those that the curation process moved from having annotations to not having any?
On Tue, Apr 14, 2020 at 6:29 PM Patrice Lopez notifications@github.com wrote:
I've reviewed the final XML file and indeed you are right it only contains a few documents without any annotations. These documents are the ones I checked manually and we have in total 1247 manually checked articles.
I didn't look and validate the other documents without any annotations, so there are not in the compiled XML. Sorry for my wrong indication, it's something a bit old now and I have forgotten that aspect (though it is involving a lot of articles indeed!).
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/howisonlab/softcite-dataset/issues/660#issuecomment-613730740, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAWOUWFLWEXHAB2BV4DIH3RMTWXHANCNFSM4MIAMD7A .
yes exactly !
Ah, ok. So, two options.
Thoughts?
On Tue, Apr 14, 2020 at 8:19 PM Patrice Lopez notifications@github.com wrote:
yes exactly !
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/howisonlab/softcite-dataset/issues/660#issuecomment-613761277, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAWOURLHE7PS2F5HKOO23LRMUDRRANCNFSM4MIAMD7A .
I guess about 1:
So on my side, I would not see particular issues to have everything in one single XML file.
Ah, yes, I see in the first con. That makes me lean towards the option of a second XML file.
Has all the text been checked in the no mention articles in the current XML file, or just the paras that once had a mention but now does not, plus any strings that match any other mention in the set?
On Thu, Apr 16, 2020 at 00:07 Patrice Lopez notifications@github.com wrote:
I guess about 1:
- cons -> the articles not in the revised compiled XML have not been double checked like those present
- pro -> the users of the XML file would normally focus on the available text in the XML file, in particular as training data, so the presence or not of references to documents without any mentions would have no particular impact and it does not seem useful to provide multiple cross-agreement and review on the missing articles
So on my side, I would not see particular issues to have everything in one single XML file.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/howisonlab/softcite-dataset/issues/660#issuecomment-614418366, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAWOUWZPFXFWRZPNQZ6RK3RM2G73ANCNFSM4MIAMD7A .
Resolved via phone, documented in JASIST paper.
We planned to include the identifiers for articles our annotators coded but did not find any software mentions in. They are in the CSV dataset, but they don't seem to be in the XML file?
@kermit2 any idea? @caifand
I hope that binder would work to show you, but having trouble with it at this URL:
https://mybinder.org/v2/gh/howisonlab/softcite-dataset/master?filepath=code%2FparseTEI.Rmd?urlpath=rstudio
It's just xpath but it's showing 1247 articles. We have over 5000 articles in the csv_dataset files.
gives:
So I think the XML file should have all the fileDesc for the thousands of articles that were read but no mentions were found in....