howisonlab / softcite-dataset

A gold-standard dataset of software mentions in research publications.
32 stars 50 forks source link

Data addition #665

Closed kermitt2 closed 4 years ago

kermitt2 commented 4 years ago

Proposal for the addition of data to the repo:

And I could put the python scripts that creates the JSON files from the Grobid TEI XML and the annotated corpus under softcite-dataset/code/corpus/ ?

jameshowison commented 4 years ago

Yup. Looks good. Hmmm, maybe we should call the annotated corpus files softcite-data/corpus rather than final.

James Howison

Associate Professor School of Information University of Texas at Austin http://james.howison.name

On Thu, Aug 13, 2020 at 12:28 PM Patrice Lopez notifications@github.com wrote:

Proposal for the addition of data to the repo:

  • under softcite-dataset/tei/, all the TEI files corresponding to the PDF of the dataset, as converted by Grobid
  • under softcite-dataset/json/, all the JSON file with the entity span information for the annotations
  • under softcite-dataset/final/, the annotated corpus in TEI format

And I could put the python scripts that creates the JSON files from the Grobid TEI XML and the annotated corpus under softcite-dataset/code/corpus/ ?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/howisonlab/softcite-dataset/issues/665, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAWOUUO6OG4GEBKQ5PKGXDSAQPCZANCNFSM4P6VPRZA .

kermitt2 commented 4 years ago

Ok! actually I forgot the data/ level maybe, so:

and softcite-dataset/code/corpus/ for the related python stuff.

jameshowison commented 4 years ago

Great!

On Thu, Aug 13, 2020 at 12:34 PM Patrice Lopez notifications@github.com wrote:

Ok! actually I forgot the data/ level maybe, so:

  • softcite-dataset/data/tei/
  • softcite-dataset/data/json/
  • softcite-dataset/data/corpus/

and softcite-dataset/code/corpus/ for the related python stuff.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/howisonlab/softcite-dataset/issues/665#issuecomment-673610965, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAWOUSMQY3RUQU6QVFNGO3SAQP3VANCNFSM4P6VPRZA .

jameshowison commented 4 years ago

Working on these files. Any idea what the empty files (well empty dict in the file) imply?

SOI-A14570-Howison:json howison$ grep -r \{} .
./PMC3049466.json:{}
./PMC3727320.json:{}
./10.1007%2Fs11166-011-9127-z.json:{}
./PMC4506345.json:{}
./PMC2808573.json:{}
./PMC4266938.json:{}
./PMC5238813.json:{}
./PMC5041470.json:{}
./PMC3284254.json:{}
./PMC3065992.json:{}
./PMC4683424.json:{}
kermitt2 commented 4 years ago

The JSON files are still work-in-progress, not entirely finished... thanks for pointing to these files, it's a bug in the conversion process I think because the TEI files look good.

(this issue was about the location for the files in the repo, it's why I closed it :) I will update on the json files in #666)

jameshowison commented 4 years ago

Great. I'll follow #666.