Open heikojansen opened 2 years ago
That's interesting! We've also discussed using CSL to generate training data in the past; I'd be curious to know how a model trained on such data performs with real world input.
Obviously you would not want to train a model on 1 billion references, but with such a large resource you could just pick out samples (would also be interesting to see if a model improves after the first couple of thousand references).
So the basic idea would be to take a random set of publications from that GIANT dataset and for each publication create many citations using a number of different CSL styles; only that instead of plain strings these citations would be converted to XML sequence
elements where the different parts of the citation are chopped up into child-elements declaring the type of information within them. And then use that XML as training input.
So the most interesting question is how to generate the "annotated" (by way of XML elems) sequences for different CSL styles.
Is there a list of allowed child element names to the sequence
elements available?
You can put any element into the sequence: each element will correspond to a label that is known to the model. From what I saw it should be enough to wrap each generated XML reference in a <sequence>
and then the whole sample in a <dataset>
.
This isn't exactly an issue but a question: Would you consider it feasible and worth-while to adopt the data generated here: GIANT: The 1-Billion Annotated Synthetic Bibliographic-Reference-String Dataset for Deep Citation Parsing as training input for AnyStyle? Just curious if you see enough potential there.