Closed jameshowison closed 5 years ago
There's a version in the software mention recognizer repo yes:
https://github.com/Impactstory/software-mentions/tree/master/resources/dataset/software/corpus
but it's in the format expected by grobid for training the software model only, which mean I strip out all the other nice stuff extracted by grobid like title sections, citations, etc.
We could imagine to have a format similar to the (rich) output of GROBID, with the addition of the inline annotations coming from the csv information. There should not be conflicts I think (everything will stand under the same XML hierarchy). What do you think?
However I would also advise to release the dataset only when it's clean and mature, and associated with a version numbering, to avoid having various incomplete versions circulating, which might lead to confusion.
Right, yes I was thinking about the rich GROBID format with the added inline annotations. But yes, we need to do cleanup first and establish some versioning. I'm going to get my remaining content analysis team working on the cleanup this week.
I'm getting a few requests for the dataset, and I wonder if we're yet in a position (given continuing cleanup etc) to release the post-GROBID, post-mention matching input XML for machine learning? @kermitt2 what do you think? Or perhaps that's already in your repo?