Open jameshowison opened 4 years ago
Sorry clearly an error! Probably something I didn't clean correctly.
The attribute @id
should have been deleted here.
I don't see another occurrence of an @id
attribute, but if there are more, they should be removed too.
Ok, best to just do this manually in the xml file? (which we've added to the softcite-dataset repo, btw).
On Fri, Apr 10, 2020 at 2:08 PM Patrice Lopez notifications@github.com wrote:
Sorry clearly an error! Probably something I didn't clean correctly. The attribute @id should have been deleted here. I don't see another occurrence of an @id attribute, but if there are more, they should be removed too.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/howisonlab/softcite-dataset/issues/657#issuecomment-612175207, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAWOUVRCC35J3X3UHUOD53RL5VCVANCNFSM4MFUQ7MA .
If the source file changes I can link it back to here.
I've updated in the original repo https://github.com/ourresearch/software-mentions/commit/eb683b63df924b8da66c036a62d511688f621f0c
Spotted some rs
tags without attribute publisher
but creator
. Another edit is needed :) @kermitt2
id,publisher,creator
a2008-39-NAT_BIOTECHNOL-software-9,NA,Molecular Devices
a2008-39-NAT_BIOTECHNOL-software-5,NA,Chang Bioscience
a2008-39-NAT_BIOTECHNOL-software-0,NA,National Institutes of Health
a2008-39-NAT_BIOTECHNOL-software-7,NA,Chang Bioscience
a2008-39-NAT_BIOTECHNOL-software-8,NA,Chang Bioscience
a2007-48-UNDERSEA_HYPERBAR_M-software-0,NA,Microsoft Corp
Sorry, there is still another one: at line 13505:
<rs resp="#curator" type="software" xml:id="PMC3481138-software-1">Pictar</rs> (<rs corresp="#PMC3481138-software-0" resp="#curator" type="url">pictar.mdc-berlin.de</rs>))
Here the corresp
attribute of the url
tag does not match the xml:id
of the linked software
tag.
Those are a bit trivial, thanks for fixing them! @kermitt2
Good catch @caifand ! This is corrected with https://github.com/ourresearch/software-mentions/commit/ebf57663a794b1f77a1904f49d0768aa4b6f65f6
Some mismatched fields in rs tags: xml:id,software,version,publisher,url,description PMC4475901-software-2,MASCOT,NA,Matrix Science Ltd,NA,publisher rs tag is associated with wrong xml:id PMC4938145-software-0,Nanoscope,6.13R1,Digital Instruments,NA,publisher rs tag is associated with wrong xml:id PMC3123123-software-0,R,2.12.0,NA,NA,version rs tag is associated with wrong xml:id PMC4589266-software-2,TBSS,1.2,http://fsl.fmrib.ox.ac.uk/ fsl/fslwiki/TBSS/UserGuide,url rs tag type="version" PMC5018359-software-1,Stata,12,StataCorp,NA,repeated xml:id in the same para
@kermitt2 Thanks!
Thank you, good catch !
I have updated the reviewed TEI export -> https://github.com/ourresearch/software-mentions/commit/bdbb49ffb4951f0c6a5447293196b2217ee094ef
Sorry for this slow correction, long and lot's of internet downtime in the last days.
@kermitt2 Thanks! These are indeed small pieces of things. I am happy to fix those if it won't break the file.
sure I've invited you on the ourresearch/software-mentions so that you can make corrections directly on the file, thank you!
Hi @kermitt2 , I found that in all_clean_post_processed_with_no_mention.tei.xml
, the TEI element marking up article titles has two types: </titleStatement>
& </titleStmt>
. It seems to me that all the articles with mentions have </titleStatement>
while all articles with zero mentions have </titleStmt>
.
In all_clean_post_processed.tei.xml
, there are all </titleStatement>
, other than the top block including the whole corpus metadata (i.e., there is just </titleStmt>
in the very first </fileDesc>
block.)
@kermitt2 Should we seek to uniformize all of them? If it is just a legacy thing and does not involve changes in the pipeline, I can help with the manual edits.
Thank you @caifand, good catch !
It should be <titleStmt>
everywhere indeed. I corrected it at some point in the program that generates the XML, but it was still incorrect in the file I edited for the manual corrections, so still propagating to the final version.
This is fixed with commit https://github.com/ourresearch/software-mentions/commit/da7f761b3b71d27fbfb58ecdacea2d26a5d5b855
I've found annotated mentions from early-stage training articles in csv files. Though training articles are marked in softcite_articles.csv
, there are still some not in softcite_articles.csv
because their RDF annotations have different schemas from our established coding scheme. But their detailed annotations went into other csv files. I suspect those different schemas were used in early training when our coding scheme hadn't been stabilized yet? @jameshowison Just checking with you to see if I am correct.
I've removed all the annotations from training articles from all the csv files now. So, @kermitt2 , sorry that I have to ask for another corpus run :)
(I did this check since I've found there are still some training article annotations in TEI file)
I've regenerated the tei corpus with the updated csv: https://github.com/ourresearch/software-mentions/blob/master/resources/dataset/software/corpus/softcite_corpus.tei.xml
These articles have disappeared:
Hi @kermitt2
I've found in the regenerated tei that some xml:id
of </rs type="software">
are not unique identifiers. For example, there are 28 software mentions sharing the same xml:id
(see this file). Even some software mentions have the same name, but I checked that they are spread in different locations within paragraphs.
I have a list of </rs type="software">
identifiers that I found are not unique. Could you help check it and see if we can correct them? Thanks :)
Thank you @caifand !
I fixed this bug (commit aa802c48130ba9fb46561e58d87d2c66eb7846f8) - the numbering was not incremented :/ I also fix an issue with the xml:id
in case of DOI, they were invalid XML NCName.
So normally the new final https://github.com/ourresearch/software-mentions/blob/master/resources/dataset/software/corpus/softcite_corpus.tei.xml is now XML well formed.
@kermitt2 That's great! I see that there are some changes in the xml:id
, esp. for </rs>
in econ articles. Looks good! But it seems these xml:id
in <\rs type="software>
have not been fully aligned with corresp
in software attributes tags, esp. in econ articles and <\ab>
elements.
For example, in the econ article <fileDesc xml:id="_10.1257_2Fjep.27.1.3">
,
we have <rs type="software" xml:id="_10.1257_2Fjep.27.1.3-software-0">
but <rs corresp="#10.1257%2Fjep.27.1.3-software-0">
In PMC article <fileDesc xml:id="PMC2881384">
we have <rs id="PMC2881384-software-29" type="software">
in </ab>
but correspondingly <rs corresp="#software-29" resp="#annotator2" type="version">
Seems to me these cases are multiple but not generalized to all. Let me know if providing more information will be helpful since we have so many layers in the corpus now :)
Ah yes I've forgotten to modify similarly the corresp! I'll fix it today.
Well done @caifand for finding these errors (and sorry for missing them on my side!), I fixed the wrong @corresp
for the two cases you reported, commit ecd484bbeb013be795a10f4eb38fd3c5c9d06807.
Hi @kermitt2
A few days ago I made some direct edits to the corpus. Those edits fixed errors included (1) corresp
attribute in <rs type="software">
tags (those are in curated tags); (2) xml:id
in <rs type="software">
does not correspond with corresp
in other associated <rs>
tags (around the same software mention). (3) unexpected #
preceding xml:id
in <rs type="software">
tags. (4) some </rs> xml:id
in </p>
repeated in </ab>
. (4) in one case, there was an unexpected id
occured with xml:id
in </rs resp="#curator" type="software" xml:id="dd622d1559-software-simple-0">
. We manually fixed this one for several times.
I am not sure if any of those should involve pipeline-level change. If you think it's concerning I can chat in more details :)
Besides, there are some corresp
in <rs type="publisher">
and <rs type="URL">
missing an dash -
between the hash and the string literature software
. I attached two files below. Could you help check up those?
url-corresp-missing-dash-2020-07-11.csv
publisher-corresp-missing-dash-2020-07-11.csv
Thanks!
Hi @caifand !
I've corrected the missing dash in the cases of the two reported files (some cases of publisher
and url
identifiers) - indeed I've forgotten the dash when writing the identifier in one part.
The corpus file has been regenerated using the corrected serialization. As far as I can tell, all your corrections have been injected back and are in the final assembled TEI XML file, so the 4 cases you described. Maybe you could double check the softcite_corpus.tei.xml
quickly just to be sure.
We're parsing the tei xml file and ran into a minor issue. @kermit2 any thoughts?
The
rs
tag on line 9804 seems to have both anid
and anxml:id
while the others don't?