Closed kermitt2 closed 4 years ago
Yup. Looks good. Hmmm, maybe we should call the annotated corpus files
softcite-data/corpus
rather than final
.
James Howison
Associate Professor School of Information University of Texas at Austin http://james.howison.name
On Thu, Aug 13, 2020 at 12:28 PM Patrice Lopez notifications@github.com wrote:
Proposal for the addition of data to the repo:
- under softcite-dataset/tei/, all the TEI files corresponding to the PDF of the dataset, as converted by Grobid
- under softcite-dataset/json/, all the JSON file with the entity span information for the annotations
- under softcite-dataset/final/, the annotated corpus in TEI format
And I could put the python scripts that creates the JSON files from the Grobid TEI XML and the annotated corpus under softcite-dataset/code/corpus/ ?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/howisonlab/softcite-dataset/issues/665, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAWOUUO6OG4GEBKQ5PKGXDSAQPCZANCNFSM4P6VPRZA .
Ok! actually I forgot the data/
level maybe, so:
softcite-dataset/data/tei/
softcite-dataset/data/json/
softcite-dataset/data/corpus/
and softcite-dataset/code/corpus/
for the related python stuff.
Great!
On Thu, Aug 13, 2020 at 12:34 PM Patrice Lopez notifications@github.com wrote:
Ok! actually I forgot the data/ level maybe, so:
- softcite-dataset/data/tei/
- softcite-dataset/data/json/
- softcite-dataset/data/corpus/
and softcite-dataset/code/corpus/ for the related python stuff.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/howisonlab/softcite-dataset/issues/665#issuecomment-673610965, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAWOUSMQY3RUQU6QVFNGO3SAQP3VANCNFSM4P6VPRZA .
Working on these files. Any idea what the empty files (well empty dict in the file) imply?
SOI-A14570-Howison:json howison$ grep -r \{} .
./PMC3049466.json:{}
./PMC3727320.json:{}
./10.1007%2Fs11166-011-9127-z.json:{}
./PMC4506345.json:{}
./PMC2808573.json:{}
./PMC4266938.json:{}
./PMC5238813.json:{}
./PMC5041470.json:{}
./PMC3284254.json:{}
./PMC3065992.json:{}
./PMC4683424.json:{}
The JSON files are still work-in-progress, not entirely finished... thanks for pointing to these files, it's a bug in the conversion process I think because the TEI files look good.
(this issue was about the location for the files in the repo, it's why I closed it :) I will update on the json files in #666)
Great. I'll follow #666.
Proposal for the addition of data to the repo:
softcite-dataset/tei/
, all the TEI files corresponding to the PDF of the dataset, as converted by Grobidsoftcite-dataset/json/
, all the JSON file with the entity span information for the annotationssoftcite-dataset/final/
, the annotated corpus in TEI formatAnd I could put the python scripts that creates the JSON files from the Grobid TEI XML and the annotated corpus under
softcite-dataset/code/corpus/
?