howisonlab / softcite-dataset

A gold-standard dataset of software mentions in research publications.
32 stars 50 forks source link

exported tei xml dataset questions #657

Open jameshowison opened 4 years ago

jameshowison commented 4 years ago

We're parsing the tei xml file and ran into a minor issue. @kermit2 any thoughts?

The rs tag on line 9804 seems to have both an id and an xml:id while the others don't?

<rs id="10.1257%2Faer.20150592-software-0" resp="#curator" type="software" xml:id="10.1257%2Faer.20150592-software-simple-0">Google Scholar</rs>
kermitt2 commented 4 years ago

Sorry clearly an error! Probably something I didn't clean correctly. The attribute @id should have been deleted here. I don't see another occurrence of an @id attribute, but if there are more, they should be removed too.

jameshowison commented 4 years ago

Ok, best to just do this manually in the xml file? (which we've added to the softcite-dataset repo, btw).

On Fri, Apr 10, 2020 at 2:08 PM Patrice Lopez notifications@github.com wrote:

Sorry clearly an error! Probably something I didn't clean correctly. The attribute @id should have been deleted here. I don't see another occurrence of an @id attribute, but if there are more, they should be removed too.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/howisonlab/softcite-dataset/issues/657#issuecomment-612175207, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAWOUVRCC35J3X3UHUOD53RL5VCVANCNFSM4MFUQ7MA .

caifand commented 4 years ago

If the source file changes I can link it back to here.

kermitt2 commented 4 years ago

I've updated in the original repo https://github.com/ourresearch/software-mentions/commit/eb683b63df924b8da66c036a62d511688f621f0c

caifand commented 4 years ago

Spotted some rs tags without attribute publisher but creator. Another edit is needed :) @kermitt2

id,publisher,creator
a2008-39-NAT_BIOTECHNOL-software-9,NA,Molecular Devices
a2008-39-NAT_BIOTECHNOL-software-5,NA,Chang Bioscience
a2008-39-NAT_BIOTECHNOL-software-0,NA,National Institutes of Health
a2008-39-NAT_BIOTECHNOL-software-7,NA,Chang Bioscience
a2008-39-NAT_BIOTECHNOL-software-8,NA,Chang Bioscience
a2007-48-UNDERSEA_HYPERBAR_M-software-0,NA,Microsoft Corp 
caifand commented 4 years ago

Sorry, there is still another one: at line 13505:

<rs resp="#curator" type="software" xml:id="PMC3481138-software-1">Pictar</rs> (<rs corresp="#PMC3481138-software-0" resp="#curator" type="url">pictar.mdc-berlin.de</rs>))

Here the corresp attribute of the url tag does not match the xml:id of the linked software tag.

Those are a bit trivial, thanks for fixing them! @kermitt2

kermitt2 commented 4 years ago

Good catch @caifand ! This is corrected with https://github.com/ourresearch/software-mentions/commit/ebf57663a794b1f77a1904f49d0768aa4b6f65f6

caifand commented 4 years ago

Some mismatched fields in rs tags: xml:id,software,version,publisher,url,description PMC4475901-software-2,MASCOT,NA,Matrix Science Ltd,NA,publisher rs tag is associated with wrong xml:id PMC4938145-software-0,Nanoscope,6.13R1,Digital Instruments,NA,publisher rs tag is associated with wrong xml:id PMC3123123-software-0,R,2.12.0,NA,NA,version rs tag is associated with wrong xml:id PMC4589266-software-2,TBSS,1.2,http://fsl.fmrib.ox.ac.uk/ fsl/fslwiki/TBSS/UserGuide,url rs tag type="version" PMC5018359-software-1,Stata,12,StataCorp,NA,repeated xml:id in the same para

@kermitt2 Thanks!

kermitt2 commented 4 years ago

Thank you, good catch !

I have updated the reviewed TEI export -> https://github.com/ourresearch/software-mentions/commit/bdbb49ffb4951f0c6a5447293196b2217ee094ef

Sorry for this slow correction, long and lot's of internet downtime in the last days.

caifand commented 4 years ago

@kermitt2 Thanks! These are indeed small pieces of things. I am happy to fix those if it won't break the file.

kermitt2 commented 4 years ago

sure I've invited you on the ourresearch/software-mentions so that you can make corrections directly on the file, thank you!

caifand commented 4 years ago

Hi @kermitt2 , I found that in all_clean_post_processed_with_no_mention.tei.xml, the TEI element marking up article titles has two types: </titleStatement> & </titleStmt>. It seems to me that all the articles with mentions have </titleStatement> while all articles with zero mentions have </titleStmt>.

In all_clean_post_processed.tei.xml, there are all </titleStatement>, other than the top block including the whole corpus metadata (i.e., there is just </titleStmt> in the very first </fileDesc> block.)

@kermitt2 Should we seek to uniformize all of them? If it is just a legacy thing and does not involve changes in the pipeline, I can help with the manual edits.

kermitt2 commented 4 years ago

Thank you @caifand, good catch ! It should be <titleStmt> everywhere indeed. I corrected it at some point in the program that generates the XML, but it was still incorrect in the file I edited for the manual corrections, so still propagating to the final version. This is fixed with commit https://github.com/ourresearch/software-mentions/commit/da7f761b3b71d27fbfb58ecdacea2d26a5d5b855

caifand commented 4 years ago

I've found annotated mentions from early-stage training articles in csv files. Though training articles are marked in softcite_articles.csv, there are still some not in softcite_articles.csv because their RDF annotations have different schemas from our established coding scheme. But their detailed annotations went into other csv files. I suspect those different schemas were used in early training when our coding scheme hadn't been stabilized yet?  @jameshowison Just checking with you to see if I am correct.

I've removed all the annotations from training articles from all the csv files now. So, @kermitt2 , sorry that I have to ask for another corpus run :)

(I did this check since I've found there are still some training article annotations in TEI file)

kermitt2 commented 4 years ago

I've regenerated the tei corpus with the updated csv: https://github.com/ourresearch/software-mentions/blob/master/resources/dataset/software/corpus/softcite_corpus.tei.xml

These articles have disappeared:

caifand commented 4 years ago

Hi @kermitt2

I've found in the regenerated tei that some xml:id of </rs type="software"> are not unique identifiers. For example, there are 28 software mentions sharing the same xml:id (see this file). Even some software mentions have the same name, but I checked that they are spread in different locations within paragraphs.

I have a list of </rs type="software"> identifiers that I found are not unique. Could you help check it and see if we can correct them? Thanks :)

kermitt2 commented 4 years ago

Thank you @caifand ! I fixed this bug (commit aa802c48130ba9fb46561e58d87d2c66eb7846f8) - the numbering was not incremented :/ I also fix an issue with the xml:id in case of DOI, they were invalid XML NCName.

So normally the new final https://github.com/ourresearch/software-mentions/blob/master/resources/dataset/software/corpus/softcite_corpus.tei.xml is now XML well formed.

caifand commented 4 years ago

@kermitt2 That's great! I see that there are some changes in the xml:id, esp. for </rs> in econ articles. Looks good! But it seems these xml:id in <\rs type="software> have not been fully aligned with corresp in software attributes tags, esp. in econ articles and <\ab> elements.

For example, in the econ article <fileDesc xml:id="_10.1257_2Fjep.27.1.3">, we have <rs type="software" xml:id="_10.1257_2Fjep.27.1.3-software-0"> but <rs corresp="#10.1257%2Fjep.27.1.3-software-0">

In PMC article <fileDesc xml:id="PMC2881384"> we have <rs id="PMC2881384-software-29" type="software"> in </ab> but correspondingly <rs corresp="#software-29" resp="#annotator2" type="version">

Seems to me these cases are multiple but not generalized to all. Let me know if providing more information will be helpful since we have so many layers in the corpus now :)

kermitt2 commented 4 years ago

Ah yes I've forgotten to modify similarly the corresp! I'll fix it today.

kermitt2 commented 4 years ago

Well done @caifand for finding these errors (and sorry for missing them on my side!), I fixed the wrong @corresp for the two cases you reported, commit ecd484bbeb013be795a10f4eb38fd3c5c9d06807.

caifand commented 4 years ago

Hi @kermitt2

A few days ago I made some direct edits to the corpus. Those edits fixed errors included (1) corresp attribute in <rs type="software"> tags (those are in curated tags); (2) xml:id in <rs type="software"> does not correspond with corresp in other associated <rs> tags (around the same software mention). (3) unexpected # preceding xml:id in <rs type="software"> tags. (4) some </rs> xml:id in </p> repeated in </ab>. (4) in one case, there was an unexpected id occured with xml:id in </rs resp="#curator" type="software" xml:id="dd622d1559-software-simple-0">. We manually fixed this one for several times. I am not sure if any of those should involve pipeline-level change. If you think it's concerning I can chat in more details :)

Besides, there are some corresp in <rs type="publisher"> and <rs type="URL"> missing an dash - between the hash and the string literature software. I attached two files below. Could you help check up those? url-corresp-missing-dash-2020-07-11.csv publisher-corresp-missing-dash-2020-07-11.csv

Thanks!

kermitt2 commented 4 years ago

Hi @caifand !

I've corrected the missing dash in the cases of the two reported files (some cases of publisher and url identifiers) - indeed I've forgotten the dash when writing the identifier in one part.

The corpus file has been regenerated using the corrected serialization. As far as I can tell, all your corrections have been injected back and are in the final assembled TEI XML file, so the 4 cases you described. Maybe you could double check the softcite_corpus.tei.xml quickly just to be sure.