howisonlab / software-mentions-dataset-analysis

Analyses of software mentions and dependencies
GNU General Public License v3.0
3 stars 0 forks source link

Make comprehensive tables for mentions #13

Open willbeason opened 2 weeks ago

willbeason commented 2 weeks ago

Here's the normalized form of the software mentions files. Note that bounding boxes for references must be in their own table since their unique key additionally requires the id of the reference within the paper.

I'll need to get clarity on what some of these fields mean.

Bolded entries form a primary key, possibly composite.

PaperSchema

MentionsSchema

PagesSchema

ReferencesSchema

MentionsBoundingBoxes

ReferencesBoundingBoxes

willbeason commented 2 weeks ago

Just realized I forgot - there should be a "source" enum as well that specifies the actual object processing was run on (LaTeX/tei/etc.). That's an additional primary key on all tables.

jameshowison commented 2 weeks ago

Yeah, the UUID is the article (aka "work") but a single article can have multiple sources, so we'll need both.

jameshowison commented 2 weeks ago

Are you able to map everything to this documentation: https://github.com/softcite/software-mentions/blob/master/doc/annotation_schema.md