Training set creation using data from GIANT project?

inukshuk / anystyle

Fast citation reference parsing

https://anystyle.io

Other

1.05k stars 90 forks source link

Training set creation using data from GIANT project? #198

Open heikojansen opened 2 years ago

heikojansen commented 2 years ago

This isn't exactly an issue but a question: Would you consider it feasible and worth-while to adopt the data generated here: GIANT: The 1-Billion Annotated Synthetic Bibliographic-Reference-String Dataset for Deep Citation Parsing as training input for AnyStyle? Just curious if you see enough potential there.

inukshuk commented 2 years ago

That's interesting! We've also discussed using CSL to generate training data in the past; I'd be curious to know how a model trained on such data performs with real world input.

Obviously you would not want to train a model on 1 billion references, but with such a large resource you could just pick out samples (would also be interesting to see if a model improves after the first couple of thousand references).

heikojansen commented 2 years ago

So the basic idea would be to take a random set of publications from that GIANT dataset and for each publication create many citations using a number of different CSL styles; only that instead of plain strings these citations would be converted to XML sequence elements where the different parts of the citation are chopped up into child-elements declaring the type of information within them. And then use that XML as training input.

So the most interesting question is how to generate the "annotated" (by way of XML elems) sequences for different CSL styles. Is there a list of allowed child element names to the sequence elements available?

inukshuk commented 2 years ago

You can put any element into the sequence: each element will correspond to a label that is known to the model. From what I saw it should be enough to wrap each generated XML reference in a <sequence> and then the whole sample in a <dataset>.